当前位置:网站首页>[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video
[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video
2022-07-02 07:42:00 【Xiao Chen who wants money】
Title: Temporal Alignment Networks for Long-term Video
author :Tengda Han, Weidi Xie, and Andrew Zisserman
Publishing unit :Visual Geometry Group, University of Oxford and Shanghai Jiao Tong University
key word :clip、video
The paper :https://arxiv.org/pdf/2204.02968.pdf
First of all, I'm not in the direction of video , If there is a mistake , Welcome to correct .
Abstract
The goal of this paper is to establish a time aligned network , The network absorbs long-term video sequences and related text sentences , In order to :(1) Determine whether the sentence is aligned with the video ; and (2) If it can be aligned , Then determine its alignment . The challenge is from large-scale data sets ( Such as HowTo100M) Training such a network , The relevant text sentences have significant noise , And there is only weak alignment in Correlation .
In addition to proposing alignment Networks , We have also made four contributions :(i) We describe a new Collaborative training methods , This method can be used in the case of large noise , Do not use manual annotation to denoise and train the original teaching video ;(ii) To benchmark alignment performance , We Manual curate One. 10 Hours of HowTo100M A subset of , in total 80 A video , Its time description is very few . Our model , after HowTo100M Training for , Stronger than baseline on this aligned dataset (CLIP,MIL-NCE) There are great advantages ;(iii) We apply the zero shot training model to multiple downstream video understanding tasks , And realize the most Advanced results , Include YouCook2 Text video retrieval on , And weak supervised video action segmentation on breakfast action ;(iv) We use automaticallyaligned HowTo100M Notes fine tune the trunk model end-to-end , And get better performance in the downstream action recognition task .
Preliminary knowledge
Video alignment
As shown in the figure below , I just hope that words and pictures can correspond , Blue represents aligned text , Orange means that the text is not aligned ( Because this sentence may describe the taste of real objects , Time and so on ).
Task description
Given an untrimmed video X={I,S}, among I={I1,I2, ..., IT},T On behalf of T A frame .S={S1,...,Sk},K representative K A sentence ( Sort by time ). For the first k A sentence , We have the corresponding timestamp ([t_k^start, t_k^end]). Our goal is to pass a nonlinear function obtain {y_hat, A_hat}.
among ,y_hat Is a binary number of all sentences , So the dimension is K*2. This binary number represents whether the sentence is an aligned text .A_hat Is an alignment matrix of image and text .
TAN
TAN The structure of is shown on the left of the above figure . The picture passes S3D-G backbone The extracted features , obtain vision token, Text through word2vec embedding+ 2 linear obtain text token, The two go through one multimodal transformer Get... With interactive information and
. These two are passing cosine similarity Calculate an alignment matrix . meanwhile ,
use 1 individual linear layer To output y_hat. The formula is summarized as follows :
Training
Learn by contrast .InfoNCE. The formula is shown in figure .( This part is a little unclear )
Co-training
co-training Is the core , The author first puts forward a dual encoder, Pictured 2 Shown on the right of ,dual encoder There is no information interaction , There is information exchange only when the matrix is finally calculated . The author believes that this can make the model more sensitive .
Pictured 3(a) And graph 3(b) Shown , This is a TAN and dual encoder The similarity matrix of , union TAN and Dual encoder Output , take TAN The output of Dual-Encoder The output calculation of IoU, If a threshold is exceeded , Then we will 2 Make an output of the and pseudo-labels. If the threshold is not exceeded , Then keep the previous label .
边栏推荐
- 【AutoAugment】《AutoAugment:Learning Augmentation Policies from Data》
- 【Mixup】《Mixup:Beyond Empirical Risk Minimization》
- Mmdetection installation problem
- Label propagation
- 【Programming】
- 使用百度网盘上传数据到服务器上
- Calculate the total in the tree structure data in PHP
- 【AutoAugment】《AutoAugment:Learning Augmentation Policies from Data》
- 程序的内存模型
- Traditional target detection notes 1__ Viola Jones
猜你喜欢
点云数据理解(PointNet实现第3步)
机器学习理论学习:感知机
A slide with two tables will help you quickly understand the target detection
MoCO ——Momentum Contrast for Unsupervised Visual Representation Learning
ModuleNotFoundError: No module named ‘pytest‘
【Mixup】《Mixup:Beyond Empirical Risk Minimization》
ModuleNotFoundError: No module named ‘pytest‘
PointNet原理证明与理解
Use Baidu network disk to upload data to the server
Machine learning theory learning: perceptron
随机推荐
Two dimensional array de duplication in PHP
【Programming】
The difference and understanding between generative model and discriminant model
@Transitional step pit
Faster-ILOD、maskrcnn_ Benchmark installation process and problems encountered
[paper introduction] r-drop: regulated dropout for neural networks
Play online games with mame32k
ERNIE1.0 与 ERNIE2.0 论文解读
mmdetection训练自己的数据集--CVAT标注文件导出coco格式及相关操作
[torch] the most concise logging User Guide
【信息检索导论】第二章 词项词典与倒排记录表
ABM论文翻译
A summary of a middle-aged programmer's study of modern Chinese history
win10+vs2017+denseflow编译
[model distillation] tinybert: distilling Bert for natural language understanding
矩阵的Jordan分解实例
程序的内存模型
Jordan decomposition example of matrix
基于pytorch的YOLOv5单张图片检测实现
Transform the tree structure into array in PHP (flatten the tree structure and keep the sorting of upper and lower levels)