当前位置:网站首页>ACM mm 2022 video understanding challenge video classification track champion autox team technology sharing
ACM mm 2022 video understanding challenge video classification track champion autox team technology sharing
2022-07-01 18:51:00 【Zhiyuan community】
ACM Multimedia( abbreviation ACM MM) Began in 1993 year , It is the top event for academic and industrial exchanges in the international multimedia field , It is also the only one in the multimedia field recommended by the Chinese computer society A International Academic Conference . Video understanding pre training challenge (Pre-training For Video Understanding Challenge) It is one of the important events held by it .
In this competition , Fourth normal form AutoX The team used a new time-domain multi-scale pre training video classification scheme , Won the first place in the video classification circuit with obvious advantages .

Introduction to the contest question
In recent years , With the rise of short video , There are hundreds of millions of multimedia videos on the Internet , These videos often have topics such as video 、 Weak markers such as classification , With high marking noise , Features such as large category span . Although the latest advances in computer vision have been in video classification 、 Video with text 、 Video target detection and other fields have achieved great success , How to effectively use a large number of unmarked or weakly marked videos that widely exist in the Internet is still a topic worthy of study . This time Pre-training For Video Understanding Challenge The competition aims to promote people's research on video pre training technology , Encourage the research team to design new pre training techniques to improve a series of downstream tasks .
In this article, we focus on the video classification track , The competition offers from Youtube The contents grabbed from the top 300 Pre training data set of 10000 videos YOVO-3M, Each video is included in Youtube The video title and a query As a video category ( Such as bowling、archery、tigher cat etc. ), At the same time, it provides a downstream task data set containing 100000 videos YOVO-downstream, Data set from 70173 Training set of videos 、16439 Verification set of videos and 16554 The test set of videos consists of , These videos are divided into 240 Of the predefined categories , Including objects ( Such as Aircraft、Pizza、Football) And human action ( Such as Waggle、High jump、Riding).
In this track , stay YouTube Video and YOVO-3M Corresponding query and title The foundation is , The goal of the contestants is to get a general representation of the video through pre training , It can be further used to promote the downstream tasks of video classification . The competition requires contestants to provide YOVO-3M Data sets ( As training data ) And published YOVO Downstream data sets ( As training data for downstream tasks ) Develop video classification system . Finally, the classification system is used in the downstream task data set top-1 Accuracy as a measure . meanwhile , There is no restriction on the use of external data sets .

query: brushing
title: Disney Jr Puppy Dog Pals Morning Routine Brushing Teeth, Taking a Bath, and Eating Breakfast!
Solution
We developed a “ Multiple time domain resolution integration ” technology , Improve the effect of model pre training and downstream tasks through integrated learning , And it integrates seven different network structures to learn different video representations . In the following pages , We will introduce the multi time domain resolution integration technology proposed by the team and briefly introduce several network structures we used in the competition .
2.1 Ensemble on Multiple Temporal Resolutions
Ensemble learning can significantly improve the performance of models in various tasks , One of the core of the variance reduction method is that different base learners are needed to learn different knowledge from the data , Thus, the final generalization performance can be improved through the consensus of different base learners .Bagging [13] Is one of the representative algorithms . We from Bagging Starting from the thought of , It is different from the way of training subsets by random sampling in the original algorithm , We use different time-domain sampling rates to sample video , Get training sets with different time-domain resolutions , So as to train different basic learners . Experiments show that our method can significantly improve the effect of integration , meanwhile , Because every basic learner can use all training videos , And then achieve higher single model performance , Our method is also better than the traditional Bagging Integration strategy .

▲ Fusion With Multiple Temporal Resolusion

▲ Integrated experiment
2.2 Backbones
We tested Temporal Segment Network [10,11]、TimeSformer [2]、BEiT [1]、Swin Transformer [7]、Video Swin Transformer [8] Five kinds Frame-based The Internet and Spatiotemporal The Internet . In the experiment ,Video Swin Transformer Achieved the best model effect . We also compare the computational complexity of different network structures .

In the final submission , We will have seven different network structures 、 Model integration with different pre training data sets and different sampling rates , The optimal test set is obtained top-1 precision 62.39, Finally, I won the first place in the video classification circuit of this race .

summary
This time ACM Multimedia 2022 Video understanding contest , We use the integration strategy of multiple time-domain sampling , At the same time, integrate a variety of different network structures and pre training data sets , Finally, it won the first place in the video classification circuit of this race , It proposes a new way for video understanding and pre training .
边栏推荐
- Database foundation: select basic query statement
- R语言使用epiDisplay包的tableStack函数制作统计汇总表格(基于目标变量分组的描述性统计、假设检验等)、不设置by参数则计算数据框指定数据列范围的基础描述性统计信息
- 如何使用物联网低代码平台进行个人设置?
- Basic knowledge and commands of disk
- Force buckle day33
- Qt中的QFile读写文件操作
- 摄像头的MIPI接口、DVP接口和CSI接口[通俗易懂]
- Leetcode-128 longest continuous sequence
- R语言ggplot2可视化:gganimate创建动态柱状图动画(gif)、在动画中沿给定维度逐步显示柱状图、enter_grow函数和enter_fade函数控制运动内插退出(渐变tweening)
- 搭建一个通用监控告警平台,架构上需要有哪些设计
猜你喜欢

Must see, time series analysis

Halcon image calibration enables subsequent image processing to become the same as the template image

解决方案:可以ping别人,但是别人不能ping我

Altair HyperWorks 2022软件安装包和安装教程

Privacy sandbox is finally coming

主成分之综合竞争力案例分析

毕业季 | 华为专家亲授面试秘诀:如何拿到大厂高薪offer?

Salesmartly has some tricks for Facebook chat!

主成分计算权重

用WPF写一款开源方便、快捷的数据库文档查询、生成工具
随机推荐
创建您自己的NFT集合并发布一个Web3应用程序来展示它们(介绍)
Implement a Prometheus exporter
摄像头的MIPI接口、DVP接口和CSI接口[通俗易懂]
linux下清理系统缓存并释放内存
Qt中的QFile读写文件操作
主成分之综合竞争力案例分析
为什么独立站卖家都开始做社交媒体营销?原来客户转化率能提高这么多!
1、《创建您自己的NFT集合并发布一个Web3应用程序来展示它们》什么是NFT
Memo - about C # generating barcode
Regular expression
Leetcode-160 intersecting linked list
Three.js学习-相机Camera的基本操作(了解向)
Thread forced join, thread forced join application scenarios
LeetCode-21合并两个有序链表
R语言使用epiDisplay包的dotplot函数通过点图的形式可视化不同区间数据点的频率、使用pch参数自定义指定点图数据点的形状
Solution: you can ping others, but others can't ping me
Sum of three numbers
字节跳动数据平台技术揭秘:基于 ClickHouse 的复杂查询实现与优化
主成分计算权重
Facebook聊单,SaleSmartly有妙招!