当前位置:网站首页>ACM mm 2022 video understanding challenge video classification track champion autox team technology sharing
ACM mm 2022 video understanding challenge video classification track champion autox team technology sharing
2022-07-01 18:51:00 【Zhiyuan community】
ACM Multimedia( abbreviation ACM MM) Began in 1993 year , It is the top event for academic and industrial exchanges in the international multimedia field , It is also the only one in the multimedia field recommended by the Chinese computer society A International Academic Conference . Video understanding pre training challenge (Pre-training For Video Understanding Challenge) It is one of the important events held by it .
In this competition , Fourth normal form AutoX The team used a new time-domain multi-scale pre training video classification scheme , Won the first place in the video classification circuit with obvious advantages .

Introduction to the contest question
In recent years , With the rise of short video , There are hundreds of millions of multimedia videos on the Internet , These videos often have topics such as video 、 Weak markers such as classification , With high marking noise , Features such as large category span . Although the latest advances in computer vision have been in video classification 、 Video with text 、 Video target detection and other fields have achieved great success , How to effectively use a large number of unmarked or weakly marked videos that widely exist in the Internet is still a topic worthy of study . This time Pre-training For Video Understanding Challenge The competition aims to promote people's research on video pre training technology , Encourage the research team to design new pre training techniques to improve a series of downstream tasks .
In this article, we focus on the video classification track , The competition offers from Youtube The contents grabbed from the top 300 Pre training data set of 10000 videos YOVO-3M, Each video is included in Youtube The video title and a query As a video category ( Such as bowling、archery、tigher cat etc. ), At the same time, it provides a downstream task data set containing 100000 videos YOVO-downstream, Data set from 70173 Training set of videos 、16439 Verification set of videos and 16554 The test set of videos consists of , These videos are divided into 240 Of the predefined categories , Including objects ( Such as Aircraft、Pizza、Football) And human action ( Such as Waggle、High jump、Riding).
In this track , stay YouTube Video and YOVO-3M Corresponding query and title The foundation is , The goal of the contestants is to get a general representation of the video through pre training , It can be further used to promote the downstream tasks of video classification . The competition requires contestants to provide YOVO-3M Data sets ( As training data ) And published YOVO Downstream data sets ( As training data for downstream tasks ) Develop video classification system . Finally, the classification system is used in the downstream task data set top-1 Accuracy as a measure . meanwhile , There is no restriction on the use of external data sets .

query: brushing
title: Disney Jr Puppy Dog Pals Morning Routine Brushing Teeth, Taking a Bath, and Eating Breakfast!
Solution
We developed a “ Multiple time domain resolution integration ” technology , Improve the effect of model pre training and downstream tasks through integrated learning , And it integrates seven different network structures to learn different video representations . In the following pages , We will introduce the multi time domain resolution integration technology proposed by the team and briefly introduce several network structures we used in the competition .
2.1 Ensemble on Multiple Temporal Resolutions
Ensemble learning can significantly improve the performance of models in various tasks , One of the core of the variance reduction method is that different base learners are needed to learn different knowledge from the data , Thus, the final generalization performance can be improved through the consensus of different base learners .Bagging [13] Is one of the representative algorithms . We from Bagging Starting from the thought of , It is different from the way of training subsets by random sampling in the original algorithm , We use different time-domain sampling rates to sample video , Get training sets with different time-domain resolutions , So as to train different basic learners . Experiments show that our method can significantly improve the effect of integration , meanwhile , Because every basic learner can use all training videos , And then achieve higher single model performance , Our method is also better than the traditional Bagging Integration strategy .

▲ Fusion With Multiple Temporal Resolusion

▲ Integrated experiment
2.2 Backbones
We tested Temporal Segment Network [10,11]、TimeSformer [2]、BEiT [1]、Swin Transformer [7]、Video Swin Transformer [8] Five kinds Frame-based The Internet and Spatiotemporal The Internet . In the experiment ,Video Swin Transformer Achieved the best model effect . We also compare the computational complexity of different network structures .

In the final submission , We will have seven different network structures 、 Model integration with different pre training data sets and different sampling rates , The optimal test set is obtained top-1 precision 62.39, Finally, I won the first place in the video classification circuit of this race .

summary
This time ACM Multimedia 2022 Video understanding contest , We use the integration strategy of multiple time-domain sampling , At the same time, integrate a variety of different network structures and pre training data sets , Finally, it won the first place in the video classification circuit of this race , It proposes a new way for video understanding and pre training .
边栏推荐
- 透过华为军团看科技之变(六):智慧公路
- Three. JS learning - basic operation of camera (learn from)
- Basic knowledge and commands of disk
- js找出数字在数组中下一个相邻的元素
- Roll out! Enlightenment!
- R language uses the transmute function of dplyr package to calculate the moving window mean value of the specified data column in dataframe data, and uses ggplot2 package to visualize the line graph b
- 力扣每日一题-第32天-589.N×树的前序遍历
- 太爱速M源码搭建,巅峰小店APP溢价寄卖源码分享
- R language uses follow up of epidisplay package Plot function visualizes the longitudinal follow-up map of multiple ID (case) monitoring indicators, and uses n.of The lines parameter specifies the num
- Implementation of converting PCM file to WAV
猜你喜欢

为什么独立站卖家都开始做社交媒体营销?原来客户转化率能提高这么多!

Using OpenSSL encryption to rebound shell traffic

AI 训练速度突破摩尔定律;宋舒然团队获得RSS 2022最佳论文奖

Salesmartly has some tricks for Facebook chat!

How does factor analysis calculate weights?

Lumiprobe Lumizol RNA 提取试剂解决方案

What designs are needed in the architecture to build a general monitoring and alarm platform

Evaluation of 6 red, yellow and black list cameras: who is the safest? Who has good picture quality? From now on, let you no longer step on thunder

Unity learning fourth week

ACM MM 2022视频理解挑战赛视频分类赛道冠军AutoX团队技术分享
随机推荐
解决方案:可以ping别人,但是别人不能ping我
用WPF写一款开源方便、快捷的数据库文档查询、生成工具
Privacy sandbox is finally coming
Image acquisition and playback of coaxpress high speed camera based on pxie interface
Lumiprobe bifunctional crosslinker sulfo cyanine 5 bis NHS ester
华为云专家详解GaussDB(for MySQL)新特性
Relational database management system of easyclick
Popular science: what does it mean to enter the kernel state?
Roll out! Enlightenment!
为什么独立站卖家都开始做社交媒体营销?原来客户转化率能提高这么多!
宏观视角看抖音全生态
Find all missing numbers in the array
Facebook聊单,SaleSmartly有妙招!
PriorityQueue的用法和底层实现原理
关于企业中台规划和 IT 架构微服务转型
隐私沙盒终于要来了
R语言使用epiDisplay包的tableStack函数制作统计汇总表格(基于目标变量分组的描述性统计、假设检验等)、不设置by参数则计算数据框指定数据列范围的基础描述性统计信息
力扣每日一题-第32天-1232. 缀点成线
主成分之综合竞争力案例分析
R language uses the aggregate function of epidisplay package to divide numerical variables into different subsets based on factor variables, and calculate the summary statistics of each subset