当前位置：网站首页>[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation

[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation

2022-06-11 23:08:00 【BQW_】

ALBEF： Visual language representation learning based on momentum distillation 《Align before Fuse：Vision and Language Representation Learning with Momentum Distillation》

Address of thesis ：https://arxiv.org/pdf/2107.07651.pdf

Related blog ：
【 natural language processing 】【 Multimodal 】CLIP： Learning transferable visual models from natural language supervision
【 natural language processing 】【 Multimodal 】ViT-BERT： The unified basic model is pre trained on non image text pair data
【 natural language processing 】【 Multimodal 】BLIP： Bootstrap language image pre training for unified visual language understanding and generation
【 natural language processing 】【 Multimodal 】FLAVA： A basic language and visual alignment model
【 natural language processing 】【 Multimodal 】SIMVLM： Simple visual language model pre training based on weak supervision
【 natural language processing 】【 Multimodal 】UniT： Based on Unification Transformer Multimodal multi task learning
【 natural language processing 】【 Multimodal 】Product1M： Weakly supervised case level product retrieval based on cross modal pre training
【 natural language processing 】【 Multimodal 】ALBEF： Visual language representation learning based on momentum distillation

One 、 brief introduction

Visual language pre training $\text{(Vision-and-Language Pre-training,VLP)}$ The goal is to start on a large scale image-text Yes Learn multimodal representation , Used to improve downstream visual language tasks $(\text{Vision-and-Language,V+L})$ . Many existing $\text{VLP}$ The method relies on a pre trained target detector to extract regions based on image features , A multi-modal code is used to fuse image features and word features . Multimodal coders are trained to solve tasks that require joint understanding of images and texts , for example ：masked language modeling and image-text matching.

Although effective , But these $\text{VLP}$ The framework has several relational limitations ：(1) Image features and word embedding are in their own space , This makes it more challenging for multimodal coders to learn to model their interactions ;(2) The standard and calculation of target detector are very expensive , Because it needs to be manually marked during pre training bounding box, And it is a high-resolution image at the time of inference ;(3) Widely used image-text Data sets are collected from the network and there are a large class of noise , Existing image $\text{MLM}$ Such a pre training target may over fit the noisy text , And reduce the generalization performance of the model .

The author puts forward $\text{ALBEF(ALign BEfore Fuse)}$ , A new $\text{VLP}$ Framework to address these limitations . First, an image encoder and a text encoder without detector will be used to encode the image and text independently . then , Multimodal coders fuse image features and text features through a cross modal attention mechanism . The author introduces an intermediate image-text Compare the loss function $(\text{ITC})$ , It is applied to the representation of single-mode encoder , It has three purposes ：(1) Align image features with text features , It makes it easier for the multimode encoder to perform cross modal learning ;(2) Improve the single-mode encoder to better understand the semantics of images and texts ;(3) It can learn a common low dimensional space to embed images and text , Find more informative samples by comparing difficult samples mining .

To improve learning under noise supervision , Momentum distillation is proposed $\text{MoD}$ , A simple way to enable the model to take advantage of larger noisy data sets . In the process of training , Maintain a momentum version of the model by averaging the model parameters , The momentum model is used to generate pseudo tags as additional supervision . Use $\text{MoD}$ , The model should not be penalized for producing reasonable output that is different from the network annotations . $\text{MoD}$ Not only can pre training be improved , It can also improve downstream tasks .

From the perspective of maximum mutual information, the author provides $\text{ALBEF}$ Theoretical analysis of . Specially , $\text{ITC}$ and $\text{MLM}$ Maximized image-text The lower boundary of mutual information for different views , These views are generated by taking partial information from each pair . From this point of view , Momentum distillation can be interpreted as generating new views with the same semantics . therefore , $\text{ALBEF}$ Be able to learn visual language representation that does not change the semantic representation .

The author is in various downstream $\text{V+L}$ The mission proved $\text{ALBEF}$ The effectiveness of the , contain image-text retrieval 、 Visual Q & A 、 Visual reasoning 、 Visual implication and weak supervision visual grounding. $\text{ALBEF}$ More than the existing state-of-the-art The method achieves significant improvement . stay image-text Searching , It is better than those methods that are pre trained on a larger data set ( $\text{CLIP}$ and $\text{ALIGN}$ ). stay $\text{VQA}$ and $\text{NLVR}$ On , Compare with state-of-the-art Method $\text{VILLA}$ , Its implementation 2.37% and 3.84% Improvement , And it has faster reasoning speed . Besides , The author also uses $\text{Grad-CAM}$ Yes $\text{ALBEF}$ Qualitative and quantitative analysis are carried out .

Two 、 $\text{ALBEF}$ Preliminary training

Please add a picture description

1. Model architecture

As shown in the figure above , $\text{ALBEF}$ Contains an image encoder 、 A text encoder And a multimode encoder . Use 12 Layer of $\text{ViT-B/16}$ As an image encoder , And then use it in $\text{ImageNet-1K}$ Initialize with the weight obtained from the upper pre training . An input image $\text{I}$ Encoded as an embedded sequence ： $\{\textbf{v}_{cls},\textbf{v}_1,\dots,\textbf{v}_N\}$ , among $v_{cls}$ yes [CLS] Embedding vector of . Use 6 Layer of $\text{Transformer}$ As text encoder and multimode encoder . The text encoder uses $\text{BERT}_{base}$ Before 6 Layer initialization , The multimode encoder layer uses $\text{BERT}_{base}$ After 6 Layer initialization . The text encoder will enter text $T$ Convert to an embedded vector sequence $\{\textbf{w}_{cls},\textbf{w}_1,\dots,\textbf{w}_N\}$ , It will be input to the multimode encoder . In each layer of the multi-modal encoder, image features and text features are fused through the attention mechanism .

2. Pre training objectives

Pre training with three objective functions $\text{ALBEF}$ ： Single mode encoder image-text Comparative learning $(\text{ITC})$ , Masking language model on multimode encoder $(\text{MLM})$ and image-text matching $(\text{ITM})$ . Besides , Here we also use online hard negative sample mining comparison to improve $\text{ITM}$ .

2.1 $\text{Image-text}$ Comparative learning $(\text{ITC})$

The goal of the loss function is to better learn the unimodal representation before fusion . It will learn a similar function $s=g_v(\textbf{v}_{cls})^\top g_w(\textbf{w}_{cls})$ , Make parallel image-text Yes Have higher similarity scores . $g_v$ and $g_w$ Yes, it will [CLS] The embedded vector is mapped to a linear transformation of normalized low dimensional representation . suffer $\text{MoCo}$ inspire , Two queues are maintained to store the most recent... From the momentum singlemode encoder $M$ individual image-text Express . The normalized feature from the momentum encoder is expressed as $g_v'(\textbf{v}_{cls}')$ and $g_w'(\textbf{w}_{cls}')$ . Definition $s(I,T)=g_v(\textbf{v}_{cls})^\top g_w'(\textbf{w}_{cls}')$ And $s(T,I)=g_w(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}')$ .

For each image and text , Calculation image-to-text and text-to-image The similarity of is ：
$p_m^{i2t}(I)=\frac{exp(s(I,T_m)/\tau)}{\sum_{m=1}^M exp(s(I,T_m)\tau)},\quad p_m^{t2i}=\frac{exp(s(T,I_m)/\tau)}{\sum_{m=1}^M exp(s(T, I_m)/\tau)} \tag{1}$
among , $\tau$ Is learnable temperature Parameters . Make $\textbf{y}^{i2t}(I)$ and $\textbf{y}^{t2i}(T)$ To express true one-hot Similarity degree , Where the negative sample pair has probability 0 And the probability of positive sample pairs is 1.image-text The comparison loss function is defined as $\textbf{p}$ and $\textbf{y}$ Cross entropy of $H$ ：
$\mathcal{L}_{itc}=\frac{1}{2}\mathbb{E}_{(I,T)\sim D}\big[H(\textbf{y}^{i2t}(I),\textbf{p}^{i2t}(I))+H(\textbf{y}^{t2i}(T),\textbf{p}^{t2i}(T))\big] \tag{2}$

2.2 Masking language model $(\text{MLM})$

$\text{MLM}$ Can use images and text to predict the obscured words . With 15% The probability of random masking input tokens, And use special [MASK] token Replace . Make $\hat{T}$ Represented as obscured text , also $\textbf{p}^{msk}(I,\hat{T})$ Represents the model's effect on shadowing token The probability of prediction . $\text{MLM}$ Minimize cross entropy loss ：
$\mathcal{L}_{mlm}=\mathbb{E}_{(I,\hat{T})\sim D} H(\textbf{y}^{msk},\textbf{p}^{msk}(I,\hat{T})) \tag{3}$
among , $\textbf{y}^{msk}$ It's a one-hot Thesaurus distribution , The real token The probability of 1.

2.3 $\text{Image-Text}$ matching $(\text{ITM})$

$\text{ITM}$ Predict whether the image and text pairs match or do not match . Use a multimode encoder for [CLS] The output embedded vector is used as image-text Yes Joint representation of , And through the full link layer with a softmax To predict the probability of two categories $p^{itm}$ . $\text{ITM}$ The loss function is ：
$\mathcal{L}_{itm}=\mathbb{E}_{(I,T)\sim D} H(\textbf{y}^{itm},\textbf{p}^{itm}(I,T)) \tag{4}$
among , $\textbf{y}^{itm}$ It's a two-dimensional one-hot Vector representation .

Besides , The author puts forward an aim at $\text{ITM}$ Hard negative sample sampling strategy for tasks . If image-text Yes Share similar semantics but differ in fine-grained details , Then it can be considered as a difficult negative sample . Using the equation $(1)$ To look for the comparative similarity in batch Internal hard negative samples . about batch Each image in , From the same batch Sampling a negative text according to the comparative similarity distribution in , The more similar the text is to the image, the more likely it will be sampled . Similarly , Sample a hard negative image for each text .

$\text{ALBEF}$ All the pre training objective functions of are ：
$\mathcal{L}=\mathcal{L}_{itc}+\mathcal{L}_{mlm}+\mathcal{L}_{itm} \tag{5}$

3. Momentum distillation

For pre training image-text Yes The data is mainly collected from the network , And it contains noise . Positive sample pairs usually have weak correlations ： The text may contain words that are not related to the image , Or the image may contain entities not described in the text . about $\text{ITC}$ Study , The negative text of an image may also match the content of the image . about $\text{MLM}$ , There are some words that are different from annotations that can better describe the image . However , $\text{ITC}$ and $\text{MLM}$ Of one-hot The tag penalizes all negative predictions , And ignore these correctness .

To solve this problem , The author proposes to learn from the pseudo target generated by momentum model . Momentum model is composed of the exponential moving average version of the single-mode encoder and multi-mode encoder 、 The evolving teacher model . In the process of training , Train the basic model to match its prediction with that of the momentum model . Specially , about $\text{ITC}$ , First, the characteristics of the momentum single-mode encoder are used to calculate image-text The similarity is $s'(I,T)=g_v'(\textbf{v}_{cls}')^\top g_w'(\textbf{w}_{cls}')$ and $I)=g_w'(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}')$ . then , By replacing the equation $(1)$ Medium $s$ and $s^{'}$ To calculate pseudo tags $\textbf{q}^{i2t}$ and $\textbf{q}^{t2i}$ . $\text{ITC}_{MoD}$ The loss function is defined as ：
$\mathcal{L}_{itc}^{mod}=(1-\alpha)\mathcal{L}_{itc}+\frac{\alpha}{2}\mathbb{E}_{(I,T)\sim D}\big[\text{KL}(\textbf{q}^{i2t}(I)\parallel\textbf{p}^{i2t}(I))+\text{KL}(\textbf{q}^{t2i}(T)\parallel\textbf{p}^{t2i}(T))\big] \tag{6}$
Similarly , about $\text{MLM}$ , Make $\textbf{q}^{msk}(I,\hat{T})$ Represents the momentum model for shadowing token The probability of prediction , $\text{MLM}_{MoD}$ The loss function is ：
$\mathcal{L}_{mlm}^{mod}=(1-\alpha)\mathcal{L}_{mlm}+\alpha\mathbb{E}_{(I,\hat{T})\sim D}\text{KL}(\textbf{q}^{msk}(I,\hat{T})\parallel\textbf{p}^{msk}(I,\hat{T})) \tag{7}$
Above picture , Showing the pseudo target top-5 The candidate , It effectively captures the relevant words of an image / Text .

The author will $\text{MoD}$ Apply to downstream tasks . The final loss function of each task is the weighted combination of the original task loss function , And model prediction and pseudo label $\text{KL}$ The divergence . For simplicity , Set weights for all pre training and downstream tasks $\alpha=0.4$ .

4. Pre training dataset

follow $\text{UNITER}$ , Use two network datasets $(\text{Conceptual Captions},\text{SBU Captions})$ And two domain datasets $(\text{COCO},\text{Visual Genome})$ . The number of unique images is 4M, also image-text Yes The number is 5.1M. In order to show the expansibility of this method in large-scale network data , The author also introduces more noisy $\text{Conceptual 12M}$ Data sets , The total number of images increased to 14.1M.

5. Implementation details

The model in this paper is composed of 123.7M Parametric $\text{BERT}_{base}$ And have 85.8M Parametric $\text{ViT-B/16}$ . stay 8 block NVIDIA A100 GPUs Upper use batch size by 512 Way to pre train the model 30 individual epochs. Use with gradient attenuation of 0.02 Of AdamW Optimizer . before 1000 The learning rate is preheated to $1e^{-4}$ , And then decay to... According to the cosine schedule $1e^{-5}$ . During pre training , Using random image clipping resolution $256\times 256$ As input , And apply $\text{RandAugment}$ . In the process of fine-tuning , Increase image resolution to $384\times 384$ , And for the image patches Insert position code . The momentum parameter of the updated momentum model is set to 0.995, be used for image-text The queue size for comparative learning is set to 65536. At the first epoch Medium distillation weight $\alpha$ in 0 Linear increase to 0.4.

3、 ... and 、 From the perspective of maximizing mutual information

In this section , Provide a $\text{ALBEF}$ Alternative perspectives for , And it shows that it is to maximize image-text Yes The lower boundary of mutual information from different perspectives . $\text{ITC}$ 、 $\text{MLM}$ and $\text{MoD}$ Can be interpreted as different ways of generating views .

Officially , Define two random variables $a$ and $b$ Two different perspectives for a data point . In self supervised learning , $a$ and $b$ Are two enhancement samples of the same image . In vision - Language means learning , consider $a$ and $b$ yes image-text Capture different variants of the same semantics . The goal is to learn a representation that doesn't change with perspective . This can be achieved by maximizing $a$ and $b$ To maximize mutual information . In practice , By minimizing $\text{InfoNCE}$ Loss function to maximize $\text{MI(a,b)}$ The lower boundary of .
$\mathcal{L}_{NCE}=-\mathbb{E}_{p(a,b)}\Bigg[\text{log}\frac{\exp(s(a,b))}{\sum_{\hat{b}\in\hat{B}}\exp(s(a,\hat{b}))}\Bigg] \tag{8}$
among , $s (a, b)$ It's a scoring function , $\hat{B}$ Contains positive samples $b$ and $|\hat{B}-1|$ Negative samples .

In this paper, the $\text{ITC}$ The loss function can be rewritten as ：
$\mathcal{L}_{itc}=-\frac{1}{2}\mathbb{E}_{p(I,T)}\Big[\log\frac{\exp(s(I,T)/\tau)}{\sum_{m=1}^M\exp(s(I,T_m)/\tau)}+\log\frac{\exp(s(T,I)/\tau)}{\sum_{m=1}^M\exp(s(T,I_m)/\tau)} \Big] \tag{9}$
To minimize the $\mathcal{L}_{itc}$ Can be seen as a maximized symmetric version of $\text{InfoNCE}$ . therefore , $\text{ITC}$ Take two independent modes as image-text Yes Two views of , The single-mode encoder is trained to maximize the angle of view of the image and text $\text{MI}$ .

$\text{MLM}$ It can also be interpreted as masking the maximum mutual information between a word and its context . say concretely , Can be rewritten $\text{MLM}$ The loss function is
$\mathcal{L}_{mlm}=-\mathbb{E}_{p(I, \hat{T})}\big[\log\frac{\exp(\psi(y^{msk})^\top f(I,\hat{T}))}{\sum_{y\in\mathcal{V}}\exp(\psi(y)^\top f(I,\hat{T}))}\big] \tag{10}$

among , $\psi(y):\mathcal{V}\rightarrow \mathbb{R}^d$ It is the output layer of multimode encoder lookup function , Map words token $y$ To a vector , also $\mathcal{V}$ Is a collection of the whole vocabulary , also $f(I,\hat{T})$ Is a final function that returns the masking context corresponding to the multimode encoder hidden state. therefore , $\text{MLM}$ take image-text Yes Think of it as two views ：(1) A randomly chosen word token;(2) Images + The context of the obscured word ;

$\text{ITC}$ and $\text{MLM}$ By getting from image-text Yes Take some information to generate views . Momentum distillation in this paper can be seen as generating alternative views from the entire distribution . With the equation $(6)$ Of $\text{ITC}_{MoD}$ For example , To minimize the $\text{KL}(\textbf{p}^{i2t}(I),\textbf{q}^{i2t}(I))$ Equivalent to minimizing the following objective function
$-\sum_{m} q_m^{i2t}(I)\log p_m^{i2t}(I)=-sum_m \frac{\exp(s'(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s'(I,T_m)/\tau)}\log \frac{\exp(s(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s(I, T_m)/\tau)} \tag{11}$
It maximizes images that share similar semantics with text $I$ Mutual information of $\textbf{MI}(I,T_m)$ , Because these texts will have a larger $q^{i2t}_m(I)$ . Similarly , $\text{ITC}_{MoD}$ It also maximizes the similarity to the image $T$ Of $\textbf{MI}(I_m,T)$ . You can follow the same way , $\text{MLM}_{MoD}$ To obscure words $y^{msk}$ Generate optional views $y'\in\mathcal{V}$ , And maximize it $y^{'}$ and $(I,\hat{T})$ Maximize information for $\text{MI}$ . therefore , Momentum distillation can be seen as performing data enhancement on the original view . The momentum model is generated with the original image-text Different views , And encourage the basic model to learn the representation of view invariant semantic information .

Four 、 The downstream $\text{V+L}$ Mission

Please add a picture description

Five downstream $\text{V+L}$ Apply the pre training model to the task . The following describes each task and fine tuning strategy .

1. $\text{Image-Text}$ retrieval

$\text{Image-Text}$ Contains two subtasks ：image-to-text retrieval $(\text{TR})$ and text-to-image retrieval $(\text{IR})$ . stay $\text{Flickr30K}$ and $\text{COCO}$ On a benchmark $\text{ALBEF}$ , And use training samples from each data set to fine tune the pre training model . about $\text{Flickr30K}$ Upper zero-shot retrieval , stay $\text{COCO}$ Fine tune the model to evaluate . In the process of fine-tuning , Joint optimization $\text{ITC}$ Loss function and $\text{ITM}$ Loss function . $\text{ITC}$ Learning based on single-mode similarity image-text Scoring function , and $\text{ITM}$ Modeling fine-grained interactions between images and text to predict matching scores . Because each image in the downstream data set contains multiple texts , change $\text{ITC}$ To consider multiple positive samples in the queue , The probability of each positive sample is 1. In the process of inference , For all the image-text Yes Calculate the characteristic similarity score $s_{itc}$ . then , use $\text{top-k}$ As candidates and calculate their $\text{ITM}$ fraction $s_{itm}$ Used to sort . because $k$ Can be set very small , The speed of inference will be much faster .

2. $\text{Visual Entailment}$

$\text{Visual Entailment}$ Used to predict whether images and text contain 、 Fine-grained visual reasoning tasks with equivalent or opposite relationships . Follow the model $\text{UNITER}$ And consider $\text{Visual Entailment}$ As a three category problem , And then in [CLS] The multimode encoder is expressed on the basis of $\text{MLP}$ To predict category probability .

3. $\text{Visual Question Answering(VQA)}$

Given an image and a problem , $\text{VQA}$ Need a model to predict an answer . Different from the existing methods, it will $\text{VQA}$ As a multi answer classification question , The author will $\text{VQA}$ Generate questions as an answer . say concretely , Use 6 Layer of $\text{Transformer}$ Decoder to generate the answer . Pictured above $(a)$ Shown , The autoregressive answer decoder receives multimodal embedding , And then [CLS] The vector of is used as the initial input of the decoder token. alike ,[SEP] It will be appended to the output of the decoder to indicate the completion of the generation . The answer decoder uses the pre training weight of the multi model encoder to initialize , And use conditional language loss function to fine tune . In order to make a fair comparison with the existing methods , In the process of reasoning, the constraint decoder can only start from 3192 Generated from subsequent answers .

4. Natural language for visual reasoning $\text{NLVR}$

$\text{NLVR}$ The model needs to determine whether a text is a description of a pair of images . The authors extend the multimodal decoder to enable it to reason on two images . Pictured above $(b)$ Shown , Each layer of the multimode encoder is repeated as two consecutive layers $\text{Transformer}$ block , Each block contains a self - attention layer 、 A cross attention layer and a forward propagation layer . Two blocks in each layer will be initialized with the same pre training weight , Two cross attention can share the same linear projection weight . In the training model , Two blocks receive two embedded sets of image pairs . In the multimode encoder [CLS] Means to append a $\text{MLP}$ A classifier is used to predict .

about $\text{NLVR}$ , Perform additional pre training steps to prepare a new multimodal encoder for the encoded image pair . The author designed a text assignment task $(\text{text-assignment,TA})$ ： Given an image and text pair , The model needs to assign text to the first image 、 The second image 、 Or not at all . The author defines it as a three classification problem , And in [CLS] Use... On the expression $\text{FC}$ Layer to predict allocation . stay $4 M$ Use... On images $\text{TA}$ Preliminary training 1 individual epoch.

5. $\text{Visual Grounding}$

$\text{Visual Grounding}$ The goal of is to locate the area in the image that is related to a specific text description . The author studies the weak supervision setting , That is to say, there are no marks $\text{bounding box}$ . The author in $\text{RefCOCO+}$ Perform experiments on datasets , And use the image-text Retrieve the results of the same policy image-text Monitor to fine tune the model . In the process of inference , The author extends $\text{Grad-CAM}$ To get the heat map , And use them to sort the detected objects .