当前位置:网站首页>"Actbert" Baidu & Sydney University of technology proposed actbert to learn the global and local video text representation, which is effective in five video text tasks
"Actbert" Baidu & Sydney University of technology proposed actbert to learn the global and local video text representation, which is effective in five video text tasks
2022-07-03 20:39:00 【I love computer vision】
Official account , Find out CV The beauty of Technology
This article shares papers 『ActBERT: Learning Global-Local Video-Text Representations』, Baidu & Sydney University of technology proposed 《ActBERT》, Learn the global and local video text representation , In five videos - Valid in text tasks !
The details are as follows :
Thesis link :https://arxiv.org/abs/2011.07231
In this paper , The author puts forward ActBERT, Self supervised learning for joint video text representation in unlabeled data . First , The author uses global action information to analyze the interaction between language text and local area objects . It reveals global and local visual cues from paired video sequences and text descriptions , For detailed visual and text relationship modeling .
then , The author introduces a TaNgled Transformer block(TNT) To encode three information sources , Global behavior 、 Local area objects and language descriptions . overall situation - Local correspondence is found by clues extracted from context information . It enforces joint video text representation , To understand fine-grained objects and global human intentions . The author proves that ActBERT Generalization ability in downstream video and language tasks , Text video clip retrieval 、 Video subtitles 、 Video Q & A 、 Action segmentation and action step location .ActBERT The performance of is obviously superior to the most advanced technology , It shows its advantages in video text representation learning .
Although supervised learning has been successful in various computer vision tasks , But in recent years , Self supervised representation learning from unlabeled data has attracted more and more attention . In self supervised learning , The model is first pre trained on a large number of unlabeled data , With agency losses . The fine tuning process further helps the pre trained model deal specifically with downstream tasks . lately , Self supervised representation learning of texts has made rapid progress , One is from Transformers Characterization of bidirectional encoder (BERT) The model is significantly extended to many natural language tasks .
suffer BERT Inspiration for success in self-monitoring training , The goal of this paper is to learn a similar model for video and text joint modeling . The author uses the video based on narrative teaching video - Text relation , The aligned text is recognized by ready-made automatic speech (ASR) Model checking . These instructional videos are videos - The natural source of text Relationship Research . First , They are YouTube And other platforms . secondly , The visual frame is consistent with the teaching narrative . Text narration not only explicitly covers the objects in the scene , It also recognizes the significant actions in the video clips .
In order to BERT Extend to video and language tasks , Researchers have expanded by learning to quantify video frame features BERT Model . The original BERT Take discrete elements as input , And predict the corresponding token As the output . by comparison , Visual features are distributed representations with real values , The real value feature cannot be directly classified as discrete tags “ Vision token” forecast . therefore , Researchers discretize visual features into visual words through clustering . These visions token It can be passed directly to the original BERT Model . However , In the process of clustering , Detailed local information may be lost , For example, interactive objects 、 Human behavior . It prevents the model from revealing the fine-grained relationship between video and text . In this paper , The author puts forward ActBERT Learn a joint video text representation , This method finds global and local visual cues from paired video sequences and text descriptions . Both global and local visual signals interact with semantic flow .ActBERT Leverage deep contextual information , Use fine-grained relationships for video - Text joint modeling .
First ,ActBERT The global action 、 Local area objects and text descriptions are integrated into a joint framework . Action is crucial for various video related downstream tasks . The recognition of human behavior can prove the motion understanding ability of the model and the complex human intention reasoning ability . During model pre training , Clearly simulating human behavior may be beneficial . Although action clues are important , But in the previous self supervised video text training , They are largely ignored , In this training , Actions are handled in the same way as objects . To simulate human behavior , The author first extracts verbs from the text description , And build an action classification data set from the original data set . then , Train one 3D Convolution network to predict action labels . Embed the optimized network features as actions . such , Will represent clip level actions , And insert the corresponding action label . In addition to global motion information , The author also combines local regional information , To provide fine-grained visual clues . The object area provides detailed visual clues about the whole scene , Include area object features 、 object anchors . Language models can benefit from regional information , So as to achieve better language and visual alignment .
secondly , The author introduces a TaNgled Transformer block(TNT) To encode features from three sources , That is, the global action 、 Local area objects and languages token. Previous research is designing new Transformer Two modes are considered in the layer , That is, fine-grained object information from images and natural languages . However , In this scenario , There are three input sources . Two sources , That is, local regional features and language text , Provides a detailed description of the events in the video . Another global action feature provides human intention in time series , And the direct clues of context inference . The author designed a new TaNgled Transformer block, Used for cross modal feature learning from three sources . In order to enhance the interaction between two visual cues and language features , The author uses a separate transformer Block encodes each mode . Mutual cross modal communication was later strengthened by two additional multi headed attention blocks . Action characteristic catalytic interaction . Guided by the action characteristics , The author injects visual information into language Transformer in , And integrate language information into vision Transformer in tangled transformer Dynamically select clues in their context , To promote goal prediction .
Besides , The author also designed four agent tasks to train ActBERT, That is, masked language modeling with global and local visual clues 、 Masked action classification 、 Masked object classification and cross modal matching . In the process of the training ActBERT Transferred to five video related downstream tasks , That is, video subtitles 、 Action segmentation 、 Text video clip retrieval 、 Action step positioning and video Q & A . Quantitative experimental results show that ,ActBERT It achieves the most advanced performance with obvious advantages .
3.1. Preliminary
This section begins with the original BERT Model .BERT Pre training language models on large corpora in an unsupervised way . The study found that , The pre training model is flexible , Conducive to various downstream tasks , For example, questions and answers .
stay BERT in , The input entity consists of multiple layers of bidirectional Transformer Handle . Each input embedding is processed through stacked self-attention layers , To aggregate contextual features . Attention weights are generated adaptively . The output feature contains contextual information about the original input sequence . In self attention , The generated features are independent of the order of the input sequence , And make the output representation have permutation invariance . When entering a sequence shuffle when , The output indicates that it is not affected . therefore , Location embedding is usually applied to each input entity , Clues in merging order .
In the initial BERT in , Two pre training tasks are used . Modeling in shielding language (MLM) Tasks , Part of the input words are randomly shielded . These blocked words are blocked by a special token “[MASK]” Instead of . The task is to predict the screened words based on the observation of the context content . The context content is an unshielded element , Provide useful relevant clues for predicting screened words .
Another task , namely Next Sentence Prediction(NSP), Model the order information between two sentences . Extract two sentences from a document ,NSP It aims to determine whether the order of the second sentence and the first sentence is correct . These two sentences pass “[SEP]” Connect , In this way, the model can know that the input is a separate sentence . According to the first token“[CLS]” To predict the output characteristics of . This is a dichotomous problem , A simple sigmoid classifier . Forecast as “1” It means that the sentence is continuous , The second sentence is just after the first sentence .
3.2. ActBERT
3.2.1 Input Embeddings
ActBERT There are four types of input elements in . They are actions 、 Image area 、 Language description and special token. special token Used to distinguish between different inputs .
Each input sequence is represented by a special token “[CLS]” Start , Take another token “[SEP]” end . Express the action characteristics as , The regional characteristics are expressed as , Sequential text features are represented as . The whole sequence is represented as
402 Payment Required
.“[SEP]” Also insert between different sentences . We can also insert between regions from different fragments “[SEP]”, This helps the model recognize fragment boundaries . For each input step , The final embedding feature consists of four different embeddings . Embedding includes Position insertion 、 Segment embedding 、token The embedded 、 Visual feature embedding . The author added some new token To distinguish action features and regional object features . Visual embedding is introduced to extract visual and motion information . These embeddings are added as ActBERT The ultimate feature of .
Position embedding
The author embeds a learnable position into each input in the sequence . Since self attention does not consider sequence information , So location coding provides a flexible way , Embed sequence when sequence order is important . For actions in different clips , The location embedding will vary with the sequence of video clips . For regions extracted from the same frame , The author uses the same location to embed . To distinguish regions in the same frame , The author considers the spatial location embedding of different spatial locations .
Segment embedding
The author considers using multiple video clips for long-term video context modeling . Each video clip or video clip has a corresponding clip embedded . These elements , Action input 、 Area object input 、 Language description , The same clip is embedded in the same video clip .
Token embedding
Every word is embedded 3 Ten thousand vocabulary entries are embedded . In addition to the special mentioned above token(“[CLS]”、“[MASK]”、“[SEP]”), The author also introduces “[ACT]” and “[REGION]”, Represent the action feature and region feature extracted from the video frame respectively . Please note that , All action inputs have the same token The embedded , This reveals the mode of the input .
Visual (action) embedding
For each video clip , The author extracts verbs from the corresponding descriptions . For the sake of simplicity , The author deleted the segment without any verbs . then , Build vocabulary from all extracted verbs . After the construction of verb vocabulary , Each video clip has one or more category tags . The author trains a three-dimensional convolutional neural network on this data set .3D The input of the network is a tensor with additional time dimensions , Then use the convolution neural network softmax classifier . For fragments with multiple tags , Author use 'L1-norm' normalization one-hot label . After training the model , Extract global average pooled features as action features . This feature can well represent the actions in the video clip .
In order to obtain regional target characteristics , The author extracts the bounding box and the corresponding visual features from the pre trained target detection network . Image region features provide detailed visual information for modeling the relationship between vision and text . For each area , Visual feature embedding is the feature vector before the output layer in the pre training network . Embedding in combination with spatial location , use 5-D The vector represents the position of the region . The vector is composed of four box coordinates and the fraction of area . say concretely , Let's express the vector as
402 Payment Required
, among W Is the frame width ,H Is the frame height ,(x 1,y 1) and (x 2,y 2) They are the coordinates of the upper left and the lower right .Then embed the vector to match the dimension of visual features . The final regional target feature is the sum of spatial location embedding and target detection features .
3.2.2 Tangled Transformer
The author designed a Tangled Transformer(TNT) To better encode the three information sources , That is, action characteristics 、 Regional object features and language features .
Treat visual and textual features the same as using only one Transformer Different , In this paper, the Tangled Transformer By three Transformer form . These three Transformer There are three feature sources . To enhance the interaction between visual and linguistic features , The author injects visual information into language Transformer, And integrate language information into vision Transformer. Through cross modal interaction ,TNT You can dynamically choose wise clues for target prediction . take Transformer block l The middle expression of is
402 Payment Required
. For the sake of simplicity , It can be expressed as , , , By w-transformer、a-transformer and r-transformer Handle , As shown in the figure above . In addition to the standard multi head attention coding features from the same mode , The author also uses two other multi headed attention blocks to enhance transformer The interaction between .In this paper, the TNT and co-attentional transformer Different in several ways . First ,co-attentional transformer It's just an attention block that transfers keys and values from one mode to another , No further pretreatment is required . second ,co-attentional transformer Treat these two modes equally , and “TNT block ” Use global cues to guide the selection of local cues from linguistic and visual features . Third , stay co-attentional transformer in , Keys and values from different modes replace the original key values , and TNT Stack the key value with the original value . such , stay Transformer In the coding process , Language and visual features are combined .
3.2.3 ActBERT Training
The figure above shows the pre training task of this method .
Masked Language Modeling with Global and Local Visual Cues
The author will BERT Masked language modeling in (MLM) The task extends to the setup of this article . The author uses visual cues from local regional objects and global actions to reveal the relationship between vision and language entities . Each word in the input sentence is masked randomly with a fixed probability . This task requires the model to learn from the context description , At the same time, relevant visual features are extracted to facilitate prediction . When a verb is blocked , The model should use the action characteristics to predict more accurately . When the description of an object is masked , Local area features can provide more context information . therefore , Powerful models need to coordinate visual and language input locally and globally . then , The output feature adds a softmax classifier .
Masked Action Classification
Again , In the classification of shielding actions , Action features are also shielded . The task is to predict masked action labels according to language features and object features . Clear action predictions are beneficial in both ways . First , Action sequence clues can be used for a long time . for example , For the action sequence “ Get into ”、“ rotate ”、“ add to ” In the video , This task can make better use of the timing information about performing this teaching task . secondly , Better cross modal modeling using regional objects and language texts . Please note that , In the classification of shielding actions , The goal is to predict the classification labels of shielding action characteristics . This task can enhance the action recognition ability of the pre training model , It can be further extended to many downstream tasks , For example, video Q & A .
Masked Object Classification
In the classification of shielding objects , Area object features are randomly masked . The author predicts the distribution of hidden image regions on the fixed vocabulary . The target distribution of the random region is calculated as softmax Activate , This activation is extracted by forwarding the region to the same pre training detection model in the feature extraction phase . Between two distributions KL Minimize differences .
Cross-modal matching
And NSP The task is similar , The author is in the first token “[CLS]” Apply a linear layer on the output of . And then a Sigmoid Classification start , Indicates the correlation score between language sentences and visual features . If the score is high , The description text well describes the video clip . The model is optimized by binary cross entropy loss . To train this cross modal matching task , The author extracts negative video text pairs from unlabeled data sets .
Video captioning
The above table shows the method of this paper in YouCook2 Data sets ,Video Captioning Experimental results on the task , explain ActBERT It can be well generalized to captioning On mission .
Action segmentation
The above table shows the method of this paper in COIN Data sets ,Action Segmentation Experimental results on the task , It can be seen that ActBERT Than Baseline Promoted 20% about .
Action step localization
The above table shows the method of this paper in CrossTask Data sets ,Action step localization Experimental results on the task .
Text-video clip retrieval
The above table shows the method of this paper in YouCook2 Data sets , video - Experimental results on text retrieval tasks .
Video question answering
The above table shows the text methods in MSRVTT and LMSDC On dataset QA The experimental results of the task , It can be seen that , The method in this paper can be obviously attributed to other baseline Method .
In this article , The author introduced ActBERT Joint video text modeling in a self supervised way . The model directly models global and local visual cues , To achieve fine-grained visual and language relationship learning .ActBERT Take three sources of information as input , That is, the global action 、 Local area objects and language descriptions .TaNgled Transformer Further enhanced communication between the three sources . The quantitative results of five video text benchmarks prove ActBERT The effectiveness of the .
Reference material
Welcome to join 「 Visual language 」 Exchange group notes :VL
- Machine learning support vector machine SVM
- App compliance
- Basic number theory -- Chinese remainder theorem
- Plan for the first half of 2022 -- pass the PMP Exam
- jvm jni 及 pvm pybind11 大批量数据传输及优化
- Microservice knowledge sorting - search technology and automatic deployment technology
- [effective Objective-C] - block and grand central distribution
- Qtablewidget control of QT
- Use nodejs+express+mongodb to complete the data persistence project (with modified source code)
- Research Report on the overall scale, major manufacturers, major regions, products and application segmentation of rotary tablet presses in the global market in 2022
Task of gradle learning
Do you really know how old you are?
2.1 use of variables
Use nodejs+express+mongodb to complete the data persistence project (with modified source code)
【愚公系列】2022年7月 Go教学课程 002-Go语言环境安装
Gee calculated area
Rhcsa third day operation
Sparse matrix (triple) creation, transpose, traversal, addition, subtraction, multiplication. C implementation
How to read the source code [debug and observe the source code]
Introduction to golang garbage collection
CesiumJS 2022^ 源码解读[7] - 3DTiles 的请求、加载处理流程解析
Refer to some books for the distinction between blocking, non blocking and synchronous asynchronous
2.2 integer
[effective Objective-C] - block and grand central distribution
In 2021, the global foam protection packaging revenue was about $5286.7 million, and it is expected to reach $6615 million in 2028
Operate BOM objects (key)
For in, foreach, for of
Cannot load driver class: com. mysql. cj. jdbc. Driver
Basic number theory -- Chinese remainder theorem
jvm jni 及 pvm pybind11 大批量数据传输及优化
Golang type assertion and conversion (and strconv package)
Sword finger offer 30 Stack containing min function
Global and Chinese market of full authority digital engine control (FADEC) 2022-2028: Research Report on technology, participants, trends, market size and share
Research Report on the overall scale, major manufacturers, major regions, products and application segmentation of rotary tablet presses in the global market in 2022
Phpexcel import export
Etcd 基于Raft的一致性保证
Test access criteria