当前位置:网站首页>ACM mm 2022 video understanding challenge video classification track champion autox team technology sharing
ACM mm 2022 video understanding challenge video classification track champion autox team technology sharing
2022-07-01 18:51:00 【Zhiyuan community】
ACM Multimedia( abbreviation ACM MM) Began in 1993 year , It is the top event for academic and industrial exchanges in the international multimedia field , It is also the only one in the multimedia field recommended by the Chinese computer society A International Academic Conference . Video understanding pre training challenge (Pre-training For Video Understanding Challenge) It is one of the important events held by it .
In this competition , Fourth normal form AutoX The team used a new time-domain multi-scale pre training video classification scheme , Won the first place in the video classification circuit with obvious advantages .
Introduction to the contest question
In recent years , With the rise of short video , There are hundreds of millions of multimedia videos on the Internet , These videos often have topics such as video 、 Weak markers such as classification , With high marking noise , Features such as large category span . Although the latest advances in computer vision have been in video classification 、 Video with text 、 Video target detection and other fields have achieved great success , How to effectively use a large number of unmarked or weakly marked videos that widely exist in the Internet is still a topic worthy of study . This time Pre-training For Video Understanding Challenge The competition aims to promote people's research on video pre training technology , Encourage the research team to design new pre training techniques to improve a series of downstream tasks .
In this article, we focus on the video classification track , The competition offers from Youtube The contents grabbed from the top 300 Pre training data set of 10000 videos YOVO-3M, Each video is included in Youtube The video title and a query As a video category ( Such as bowling、archery、tigher cat etc. ), At the same time, it provides a downstream task data set containing 100000 videos YOVO-downstream, Data set from 70173 Training set of videos 、16439 Verification set of videos and 16554 The test set of videos consists of , These videos are divided into 240 Of the predefined categories , Including objects ( Such as Aircraft、Pizza、Football) And human action ( Such as Waggle、High jump、Riding).
In this track , stay YouTube Video and YOVO-3M Corresponding query and title The foundation is , The goal of the contestants is to get a general representation of the video through pre training , It can be further used to promote the downstream tasks of video classification . The competition requires contestants to provide YOVO-3M Data sets ( As training data ) And published YOVO Downstream data sets ( As training data for downstream tasks ) Develop video classification system . Finally, the classification system is used in the downstream task data set top-1 Accuracy as a measure . meanwhile , There is no restriction on the use of external data sets .
query: brushing
title: Disney Jr Puppy Dog Pals Morning Routine Brushing Teeth, Taking a Bath, and Eating Breakfast!
Solution
We developed a “ Multiple time domain resolution integration ” technology , Improve the effect of model pre training and downstream tasks through integrated learning , And it integrates seven different network structures to learn different video representations . In the following pages , We will introduce the multi time domain resolution integration technology proposed by the team and briefly introduce several network structures we used in the competition .
2.1 Ensemble on Multiple Temporal Resolutions
Ensemble learning can significantly improve the performance of models in various tasks , One of the core of the variance reduction method is that different base learners are needed to learn different knowledge from the data , Thus, the final generalization performance can be improved through the consensus of different base learners .Bagging [13] Is one of the representative algorithms . We from Bagging Starting from the thought of , It is different from the way of training subsets by random sampling in the original algorithm , We use different time-domain sampling rates to sample video , Get training sets with different time-domain resolutions , So as to train different basic learners . Experiments show that our method can significantly improve the effect of integration , meanwhile , Because every basic learner can use all training videos , And then achieve higher single model performance , Our method is also better than the traditional Bagging Integration strategy .
▲ Fusion With Multiple Temporal Resolusion
▲ Integrated experiment
2.2 Backbones
We tested Temporal Segment Network [10,11]、TimeSformer [2]、BEiT [1]、Swin Transformer [7]、Video Swin Transformer [8] Five kinds Frame-based The Internet and Spatiotemporal The Internet . In the experiment ,Video Swin Transformer Achieved the best model effect . We also compare the computational complexity of different network structures .
In the final submission , We will have seven different network structures 、 Model integration with different pre training data sets and different sampling rates , The optimal test set is obtained top-1 precision 62.39, Finally, I won the first place in the video classification circuit of this race .
summary
This time ACM Multimedia 2022 Video understanding contest , We use the integration strategy of multiple time-domain sampling , At the same time, integrate a variety of different network structures and pre training data sets , Finally, it won the first place in the video classification circuit of this race , It proposes a new way for video understanding and pre training .
边栏推荐
- R language epidisplay package ordinal or. The display function obtains the summary statistical information of the ordered logistic regression model (the odds ratio and its confidence interval correspo
- OpenAI|视频预训练 (VPT):基于观看未标记的在线视频的行动学习
- Technology implementation and Architecture Practice
- Vidéos courtes recommandées chaque semaine: méfiez - vous de la confusion entre « phénomène » et « problème »
- Three.js学习-相机Camera的基本操作(了解向)
- How to operate technology related we media well?
- Is Alipay wallet convenient to use?
- Five degrees easy chain enterprise app is newly upgraded
- Sum of three numbers
- 必看,时间序列分析
猜你喜欢
用WPF写一款开源方便、快捷的数据库文档查询、生成工具
Lumiprobe Lumizol RNA 提取试剂解决方案
解决方案:可以ping别人,但是别人不能ping我
bean的生命周期核心步骤总结
2. Create your own NFT collections and publish a Web3 application to show them start and run your local environment
毕业季 | 华为专家亲授面试秘诀:如何拿到大厂高薪offer?
LeetCode-21合并两个有序链表
创建您自己的NFT集合并发布一个Web3应用程序来展示它们(介绍)
主成分计算权重
About enterprise middle office planning and it architecture microservice transformation
随机推荐
Regular expression
Lumiprobe lumizol RNA extraction reagent solution
Mysql database of easyclick
Salesmartly has some tricks for Facebook chat!
About enterprise middle office planning and it architecture microservice transformation
ES6数组去重的三个简单办法
The R language uses the tablestack function of epidisplay package to make statistical summary tables (descriptive statistics based on the grouping of target variables, hypothesis testing, etc.). If th
透过华为军团看科技之变(六):智慧公路
Navicat Premium 15 永久破解和2021版本最新IDEA破解(亲测有效)
How to find customers for investment attraction in industrial parks
Lumiprobe 生物分子定量丨QuDye 蛋白定量试剂盒
1. "Create your own NFT collections and publish a Web3 application to show them." what is NFT
6款红黄黑榜摄像头评测:谁最安全?谁画质好?从此让你不再踩雷
每周推荐短视频:警惕“现象”与“问题”相互混淆
R语言caTools包进行数据划分、scale函数进行数据缩放、class包的knn函数构建K近邻分类器、table函数计算混淆矩阵
Three.js学习-相机Camera的基本操作(了解向)
搭建一个通用监控告警平台,架构上需要有哪些设计
1380. Lucky number in matrix / 1672 Total assets of the richest customers
OpenAI|视频预训练 (VPT):基于观看未标记的在线视频的行动学习
Halcon图片标定,使得后续图片处理过后变成与模板图片一样