当前位置：网站首页>[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video

[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video

2022-07-02 07:42:00 【Xiao Chen who wants money】

Title: Temporal Alignment Networks for Long-term Video

author ：Tengda Han, Weidi Xie, and Andrew Zisserman

Publishing unit ：Visual Geometry Group, University of Oxford and Shanghai Jiao Tong University

key word ：clip、video

The paper ：https://arxiv.org/pdf/2204.02968.pdf

Code ：GitHub - TengdaHan/TemporalAlignNet: [CVPR'22 Oral] Temporal Alignment Networks for Long-term Video. Tengda Han, Weidi Xie, Andrew Zisserman.

First of all, I'm not in the direction of video , If there is a mistake , Welcome to correct .

Abstract

The goal of this paper is to establish a time aligned network , The network absorbs long-term video sequences and related text sentences , In order to ：（1） Determine whether the sentence is aligned with the video ; and （2） If it can be aligned , Then determine its alignment . The challenge is from large-scale data sets （ Such as HowTo100M） Training such a network , The relevant text sentences have significant noise , And there is only weak alignment in Correlation .
In addition to proposing alignment Networks , We have also made four contributions ：（i） We describe a new Collaborative training methods , This method can be used in the case of large noise , Do not use manual annotation to denoise and train the original teaching video ;（ii） To benchmark alignment performance , We Manual curate One. 10 Hours of HowTo100M A subset of , in total 80 A video , Its time description is very few . Our model , after HowTo100M Training for , Stronger than baseline on this aligned dataset （CLIP,MIL-NCE） There are great advantages ;（iii） We apply the zero shot training model to multiple downstream video understanding tasks , And realize the most Advanced results , Include YouCook2 Text video retrieval on , And weak supervised video action segmentation on breakfast action ;（iv） We use automaticallyaligned HowTo100M Notes fine tune the trunk model end-to-end , And get better performance in the downstream action recognition task .

Preliminary knowledge

Video alignment

As shown in the figure below , I just hope that words and pictures can correspond , Blue represents aligned text , Orange means that the text is not aligned （ Because this sentence may describe the taste of real objects , Time and so on ）.

Task description

Given an untrimmed video X={I,S}, among I={I1,I2, ..., IT},T On behalf of T A frame .S={S1,...,Sk},K representative K A sentence （ Sort by time ）. For the first k A sentence , We have the corresponding timestamp （[t_k^start, t_k^end]）. Our goal is to pass a nonlinear function $\Phi$ obtain {y_hat, A_hat}.

among ,y_hat Is a binary number of all sentences , So the dimension is K*2. This binary number represents whether the sentence is an aligned text .A_hat Is an alignment matrix of image and text .

TAN

TAN The structure of is shown on the left of the above figure . The picture passes S3D-G backbone The extracted features , obtain vision token, Text through word2vec embedding+ 2 linear obtain text token, The two go through one multimodal transformer Get... With interactive information $\hat{v}$ and $\hat{s}$ . These two are passing cosine similarity Calculate an alignment matrix . meanwhile , $\hat{s}$ use 1 individual linear layer To output y_hat. The formula is summarized as follows ：

Training

Learn by contrast .InfoNCE. The formula is shown in figure .（ This part is a little unclear ）

Co-training

co-training Is the core , The author first puts forward a dual encoder, Pictured 2 Shown on the right of ,dual encoder There is no information interaction , There is information exchange only when the matrix is finally calculated . The author believes that this can make the model more sensitive .

Pictured 3(a) And graph 3(b) Shown , This is a TAN and dual encoder The similarity matrix of , union TAN and Dual encoder Output , take TAN The output of Dual-Encoder The output calculation of IoU, If a threshold is exceeded , Then we will 2 Make an output of the and pseudo-labels. If the threshold is not exceeded , Then keep the previous label .

原网站

版权声明
本文为[Xiao Chen who wants money]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020622547587.html