当前位置:网站首页>[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video
[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video
2022-07-02 07:42:00 【Xiao Chen who wants money】
Title: Temporal Alignment Networks for Long-term Video
author :Tengda Han, Weidi Xie, and Andrew Zisserman
Publishing unit :Visual Geometry Group, University of Oxford and Shanghai Jiao Tong University
key word :clip、video
The paper :https://arxiv.org/pdf/2204.02968.pdf
First of all, I'm not in the direction of video , If there is a mistake , Welcome to correct .
Abstract
The goal of this paper is to establish a time aligned network , The network absorbs long-term video sequences and related text sentences , In order to :(1) Determine whether the sentence is aligned with the video ; and (2) If it can be aligned , Then determine its alignment . The challenge is from large-scale data sets ( Such as HowTo100M) Training such a network , The relevant text sentences have significant noise , And there is only weak alignment in Correlation .
In addition to proposing alignment Networks , We have also made four contributions :(i) We describe a new Collaborative training methods , This method can be used in the case of large noise , Do not use manual annotation to denoise and train the original teaching video ;(ii) To benchmark alignment performance , We Manual curate One. 10 Hours of HowTo100M A subset of , in total 80 A video , Its time description is very few . Our model , after HowTo100M Training for , Stronger than baseline on this aligned dataset (CLIP,MIL-NCE) There are great advantages ;(iii) We apply the zero shot training model to multiple downstream video understanding tasks , And realize the most Advanced results , Include YouCook2 Text video retrieval on , And weak supervised video action segmentation on breakfast action ;(iv) We use automaticallyaligned HowTo100M Notes fine tune the trunk model end-to-end , And get better performance in the downstream action recognition task .
Preliminary knowledge
Video alignment
As shown in the figure below , I just hope that words and pictures can correspond , Blue represents aligned text , Orange means that the text is not aligned ( Because this sentence may describe the taste of real objects , Time and so on ).
Task description
Given an untrimmed video X={I,S}, among I={I1,I2, ..., IT},T On behalf of T A frame .S={S1,...,Sk},K representative K A sentence ( Sort by time ). For the first k A sentence , We have the corresponding timestamp ([t_k^start, t_k^end]). Our goal is to pass a nonlinear function obtain {y_hat, A_hat}.
among ,y_hat Is a binary number of all sentences , So the dimension is K*2. This binary number represents whether the sentence is an aligned text .A_hat Is an alignment matrix of image and text .
TAN
TAN The structure of is shown on the left of the above figure . The picture passes S3D-G backbone The extracted features , obtain vision token, Text through word2vec embedding+ 2 linear obtain text token, The two go through one multimodal transformer Get... With interactive information and
. These two are passing cosine similarity Calculate an alignment matrix . meanwhile ,
use 1 individual linear layer To output y_hat. The formula is summarized as follows :
Training
Learn by contrast .InfoNCE. The formula is shown in figure .( This part is a little unclear )
Co-training
co-training Is the core , The author first puts forward a dual encoder, Pictured 2 Shown on the right of ,dual encoder There is no information interaction , There is information exchange only when the matrix is finally calculated . The author believes that this can make the model more sensitive .
Pictured 3(a) And graph 3(b) Shown , This is a TAN and dual encoder The similarity matrix of , union TAN and Dual encoder Output , take TAN The output of Dual-Encoder The output calculation of IoU, If a threshold is exceeded , Then we will 2 Make an output of the and pseudo-labels. If the threshold is not exceeded , Then keep the previous label .
边栏推荐
- 【Wing Loss】《Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks》
- MMDetection安装问题
- SSM supermarket order management system
- iOD及Detectron2搭建过程问题记录
- Play online games with mame32k
- allennlp 中的TypeError: Object of type Tensor is not JSON serializable错误
- Ppt skills
- 【Programming】
- PHP uses the method of collecting to insert a value into the specified position in the array
- Interpretation of ernie1.0 and ernie2.0 papers
猜你喜欢
Drawing mechanism of view (I)
SSM garbage classification management system
半监督之mixmatch
Installation and use of image data crawling tool Image Downloader
Faster-ILOD、maskrcnn_benchmark训练coco数据集及问题汇总
ModuleNotFoundError: No module named ‘pytest‘
Semi supervised mixpatch
mmdetection训练自己的数据集--CVAT标注文件导出coco格式及相关操作
【Mixup】《Mixup:Beyond Empirical Risk Minimization》
论文写作tip2
随机推荐
Classloader and parental delegation mechanism
[mixup] mixup: Beyond Imperial Risk Minimization
ERNIE1.0 与 ERNIE2.0 论文解读
Optimization method: meaning of common mathematical symbols
Faster-ILOD、maskrcnn_ Benchmark training coco data set and problem summary
Feeling after reading "agile and tidy way: return to origin"
Ppt skills
ABM thesis translation
【BERT,GPT+KG调研】Pretrain model融合knowledge的论文集锦
Find in laravel8_ in_ Usage of set and upsert
Huawei machine test questions-20190417
程序的执行
Implementation of purchase, sales and inventory system with ssm+mysql
win10解决IE浏览器安装不上的问题
win10+vs2017+denseflow编译
传统目标检测笔记1__ Viola Jones
Semi supervised mixpatch
CPU的寄存器
Spark SQL task performance optimization (basic)
The difference and understanding between generative model and discriminant model