当前位置:网站首页>[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation

[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation

2022-06-11 23:08:00 BQW_

ALBEF: Visual language representation learning based on momentum distillation
《Align before Fuse:Vision and Language Representation Learning with Momentum Distillation》

Address of thesis :https://arxiv.org/pdf/2107.07651.pdf

Related blog :
【 natural language processing 】【 Multimodal 】CLIP: Learning transferable visual models from natural language supervision
【 natural language processing 】【 Multimodal 】ViT-BERT: The unified basic model is pre trained on non image text pair data
【 natural language processing 】【 Multimodal 】BLIP: Bootstrap language image pre training for unified visual language understanding and generation
【 natural language processing 】【 Multimodal 】FLAVA: A basic language and visual alignment model
【 natural language processing 】【 Multimodal 】SIMVLM: Simple visual language model pre training based on weak supervision
【 natural language processing 】【 Multimodal 】UniT: Based on Unification Transformer Multimodal multi task learning
【 natural language processing 】【 Multimodal 】Product1M: Weakly supervised case level product retrieval based on cross modal pre training
【 natural language processing 】【 Multimodal 】ALBEF: Visual language representation learning based on momentum distillation

One 、 brief introduction

​ Visual language pre training (Vision-and-Language Pre-training,VLP) \text{(Vision-and-Language Pre-training,VLP)} (Vision-and-Language Pre-training,VLP) The goal is to start on a large scale image-text Yes Learn multimodal representation , Used to improve downstream visual language tasks ( Vision-and-Language,V+L ) (\text{Vision-and-Language,V+L}) (Vision-and-Language,V+L). Many existing VLP \text{VLP} VLP The method relies on a pre trained target detector to extract regions based on image features , A multi-modal code is used to fuse image features and word features . Multimodal coders are trained to solve tasks that require joint understanding of images and texts , for example :masked language modeling and image-text matching.

​ Although effective , But these VLP \text{VLP} VLP The framework has several relational limitations :(1) Image features and word embedding are in their own space , This makes it more challenging for multimodal coders to learn to model their interactions ;(2) The standard and calculation of target detector are very expensive , Because it needs to be manually marked during pre training bounding box, And it is a high-resolution image at the time of inference ;(3) Widely used image-text Data sets are collected from the network and there are a large class of noise , Existing image MLM \text{MLM} MLM Such a pre training target may over fit the noisy text , And reduce the generalization performance of the model .

​ The author puts forward ALBEF(ALign BEfore Fuse) \text{ALBEF(ALign BEfore Fuse)} ALBEF(ALign BEfore Fuse), A new VLP \text{VLP} VLP Framework to address these limitations . First, an image encoder and a text encoder without detector will be used to encode the image and text independently . then , Multimodal coders fuse image features and text features through a cross modal attention mechanism . The author introduces an intermediate image-text Compare the loss function ( ITC ) (\text{ITC}) (ITC), It is applied to the representation of single-mode encoder , It has three purposes :(1) Align image features with text features , It makes it easier for the multimode encoder to perform cross modal learning ;(2) Improve the single-mode encoder to better understand the semantics of images and texts ;(3) It can learn a common low dimensional space to embed images and text , Find more informative samples by comparing difficult samples mining .

​ To improve learning under noise supervision , Momentum distillation is proposed MoD \text{MoD} MoD, A simple way to enable the model to take advantage of larger noisy data sets . In the process of training , Maintain a momentum version of the model by averaging the model parameters , The momentum model is used to generate pseudo tags as additional supervision . Use MoD \text{MoD} MoD, The model should not be penalized for producing reasonable output that is different from the network annotations . MoD \text{MoD} MoD Not only can pre training be improved , It can also improve downstream tasks .

​ From the perspective of maximum mutual information, the author provides ALBEF \text{ALBEF} ALBEF Theoretical analysis of . Specially , ITC \text{ITC} ITC and MLM \text{MLM} MLM Maximized image-text The lower boundary of mutual information for different views , These views are generated by taking partial information from each pair . From this point of view , Momentum distillation can be interpreted as generating new views with the same semantics . therefore , ALBEF \text{ALBEF} ALBEF Be able to learn visual language representation that does not change the semantic representation .

​ The author is in various downstream V+L \text{V+L} V+L The mission proved ALBEF \text{ALBEF} ALBEF The effectiveness of the , contain image-text retrieval 、 Visual Q & A 、 Visual reasoning 、 Visual implication and weak supervision visual grounding. ALBEF \text{ALBEF} ALBEF More than the existing state-of-the-art The method achieves significant improvement . stay image-text Searching , It is better than those methods that are pre trained on a larger data set ( CLIP \text{CLIP} CLIP and ALIGN \text{ALIGN} ALIGN). stay VQA \text{VQA} VQA and NLVR \text{NLVR} NLVR On , Compare with state-of-the-art Method VILLA \text{VILLA} VILLA, Its implementation 2.37% and 3.84% Improvement , And it has faster reasoning speed . Besides , The author also uses Grad-CAM \text{Grad-CAM} Grad-CAM Yes ALBEF \text{ALBEF} ALBEF Qualitative and quantitative analysis are carried out .

Two 、 ALBEF \text{ALBEF} ALBEF Preliminary training

 Please add a picture description

1. Model architecture

​ As shown in the figure above , ALBEF \text{ALBEF} ALBEF Contains an image encoder 、 A text encoder And a multimode encoder . Use 12 Layer of ViT-B/16 \text{ViT-B/16} ViT-B/16 As an image encoder , And then use it in ImageNet-1K \text{ImageNet-1K} ImageNet-1K Initialize with the weight obtained from the upper pre training . An input image I \text{I} I Encoded as an embedded sequence : { v c l s , v 1 , … , v N } \{\textbf{v}_{cls},\textbf{v}_1,\dots,\textbf{v}_N\} { vcls,v1,,vN}, among v c l s v_{cls} vcls yes [CLS] Embedding vector of . Use 6 Layer of Transformer \text{Transformer} Transformer As text encoder and multimode encoder . The text encoder uses BERT b a s e \text{BERT}_{base} BERTbase Before 6 Layer initialization , The multimode encoder layer uses BERT b a s e \text{BERT}_{base} BERTbase After 6 Layer initialization . The text encoder will enter text T T T Convert to an embedded vector sequence { w c l s , w 1 , … , w N } \{\textbf{w}_{cls},\textbf{w}_1,\dots,\textbf{w}_N\} { wcls,w1,,wN}, It will be input to the multimode encoder . In each layer of the multi-modal encoder, image features and text features are fused through the attention mechanism .

2. Pre training objectives

​ Pre training with three objective functions ALBEF \text{ALBEF} ALBEF: Single mode encoder image-text Comparative learning ( ITC ) (\text{ITC}) (ITC), Masking language model on multimode encoder ( MLM ) (\text{MLM}) (MLM) and image-text matching ( ITM ) (\text{ITM}) (ITM). Besides , Here we also use online hard negative sample mining comparison to improve ITM \text{ITM} ITM.

2.1 Image-text \text{Image-text} Image-text Comparative learning ( ITC ) (\text{ITC}) (ITC)

​ The goal of the loss function is to better learn the unimodal representation before fusion . It will learn a similar function s = g v ( v c l s ) ⊤ g w ( w c l s ) s=g_v(\textbf{v}_{cls})^\top g_w(\textbf{w}_{cls}) s=gv(vcls)gw(wcls), Make parallel image-text Yes Have higher similarity scores . g v g_v gv and g w g_w gw Yes, it will [CLS] The embedded vector is mapped to a linear transformation of normalized low dimensional representation . suffer MoCo \text{MoCo} MoCo inspire , Two queues are maintained to store the most recent... From the momentum singlemode encoder M M M individual image-text Express . The normalized feature from the momentum encoder is expressed as g v ′ ( v c l s ′ ) g_v'(\textbf{v}_{cls}') gv(vcls) and g w ′ ( w c l s ′ ) g_w'(\textbf{w}_{cls}') gw(wcls). Definition s ( I , T ) = g v ( v c l s ) ⊤ g w ′ ( w c l s ′ ) s(I,T)=g_v(\textbf{v}_{cls})^\top g_w'(\textbf{w}_{cls}') s(I,T)=gv(vcls)gw(wcls) And s ( T , I ) = g w ( w c l s ) ⊤ g v ′ ( v c l s ′ ) s(T,I)=g_w(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}') s(T,I)=gw(wcls)gv(vcls).

​ For each image and text , Calculation image-to-text and text-to-image The similarity of is :
p m i 2 t ( I ) = e x p ( s ( I , T m ) / τ ) ∑ m = 1 M e x p ( s ( I , T m ) τ ) , p m t 2 i = e x p ( s ( T , I m ) / τ ) ∑ m = 1 M e x p ( s ( T , I m ) / τ ) (1) p_m^{i2t}(I)=\frac{exp(s(I,T_m)/\tau)}{\sum_{m=1}^M exp(s(I,T_m)\tau)},\quad p_m^{t2i}=\frac{exp(s(T,I_m)/\tau)}{\sum_{m=1}^M exp(s(T, I_m)/\tau)} \tag{1} pmi2t(I)=m=1Mexp(s(I,Tm)τ)exp(s(I,Tm)/τ),pmt2i=m=1Mexp(s(T,Im)/τ)exp(s(T,Im)/τ)(1)
among , τ \tau τ Is learnable temperature Parameters . Make y i 2 t ( I ) \textbf{y}^{i2t}(I) yi2t(I) and y t 2 i ( T ) \textbf{y}^{t2i}(T) yt2i(T) To express true one-hot Similarity degree , Where the negative sample pair has probability 0 And the probability of positive sample pairs is 1.image-text The comparison loss function is defined as p \textbf{p} p and y \textbf{y} y Cross entropy of H H H
L i t c = 1 2 E ( I , T ) ∼ D [ H ( y i 2 t ( I ) , p i 2 t ( I ) ) + H ( y t 2 i ( T ) , p t 2 i ( T ) ) ] (2) \mathcal{L}_{itc}=\frac{1}{2}\mathbb{E}_{(I,T)\sim D}\big[H(\textbf{y}^{i2t}(I),\textbf{p}^{i2t}(I))+H(\textbf{y}^{t2i}(T),\textbf{p}^{t2i}(T))\big] \tag{2} Litc=21E(I,T)D[H(yi2t(I),pi2t(I))+H(yt2i(T),pt2i(T))](2)

2.2 Masking language model ( MLM ) (\text{MLM}) (MLM)

MLM \text{MLM} MLM Can use images and text to predict the obscured words . With 15% The probability of random masking input tokens, And use special [MASK] token Replace . Make T ^ \hat{T} T^ Represented as obscured text , also p m s k ( I , T ^ ) \textbf{p}^{msk}(I,\hat{T}) pmsk(I,T^) Represents the model's effect on shadowing token The probability of prediction . MLM \text{MLM} MLM Minimize cross entropy loss :
L m l m = E ( I , T ^ ) ∼ D H ( y m s k , p m s k ( I , T ^ ) ) (3) \mathcal{L}_{mlm}=\mathbb{E}_{(I,\hat{T})\sim D} H(\textbf{y}^{msk},\textbf{p}^{msk}(I,\hat{T})) \tag{3} Lmlm=E(I,T^)DH(ymsk,pmsk(I,T^))(3)
among , y m s k \textbf{y}^{msk} ymsk It's a one-hot Thesaurus distribution , The real token The probability of 1.

2.3 Image-Text \text{Image-Text} Image-Text matching ( ITM ) (\text{ITM}) (ITM)

ITM \text{ITM} ITM Predict whether the image and text pairs match or do not match . Use a multimode encoder for [CLS] The output embedded vector is used as image-text Yes Joint representation of , And through the full link layer with a softmax To predict the probability of two categories p i t m p^{itm} pitm. ITM \text{ITM} ITM The loss function is :
L i t m = E ( I , T ) ∼ D H ( y i t m , p i t m ( I , T ) ) (4) \mathcal{L}_{itm}=\mathbb{E}_{(I,T)\sim D} H(\textbf{y}^{itm},\textbf{p}^{itm}(I,T)) \tag{4} Litm=E(I,T)DH(yitm,pitm(I,T))(4)
among , y i t m \textbf{y}^{itm} yitm It's a two-dimensional one-hot Vector representation .

​ Besides , The author puts forward an aim at ITM \text{ITM} ITM Hard negative sample sampling strategy for tasks . If image-text Yes Share similar semantics but differ in fine-grained details , Then it can be considered as a difficult negative sample . Using the equation ( 1 ) (1) (1) To look for the comparative similarity in batch Internal hard negative samples . about batch Each image in , From the same batch Sampling a negative text according to the comparative similarity distribution in , The more similar the text is to the image, the more likely it will be sampled . Similarly , Sample a hard negative image for each text .

  • ALBEF \text{ALBEF} ALBEF All the pre training objective functions of are :
    L = L i t c + L m l m + L i t m (5) \mathcal{L}=\mathcal{L}_{itc}+\mathcal{L}_{mlm}+\mathcal{L}_{itm} \tag{5} L=Litc+Lmlm+Litm(5)

3. Momentum distillation

​ For pre training image-text Yes The data is mainly collected from the network , And it contains noise . Positive sample pairs usually have weak correlations : The text may contain words that are not related to the image , Or the image may contain entities not described in the text . about ITC \text{ITC} ITC Study , The negative text of an image may also match the content of the image . about MLM \text{MLM} MLM, There are some words that are different from annotations that can better describe the image . However , ITC \text{ITC} ITC and MLM \text{MLM} MLM Of one-hot The tag penalizes all negative predictions , And ignore these correctness .

​ To solve this problem , The author proposes to learn from the pseudo target generated by momentum model . Momentum model is composed of the exponential moving average version of the single-mode encoder and multi-mode encoder 、 The evolving teacher model . In the process of training , Train the basic model to match its prediction with that of the momentum model . Specially , about ITC \text{ITC} ITC, First, the characteristics of the momentum single-mode encoder are used to calculate image-text The similarity is s ′ ( I , T ) = g v ′ ( v c l s ′ ) ⊤ g w ′ ( w c l s ′ ) s'(I,T)=g_v'(\textbf{v}_{cls}')^\top g_w'(\textbf{w}_{cls}') s(I,T)=gv(vcls)gw(wcls) and s ′ ( T , I ) = g w ′ ( w c l s ) ⊤ g v ′ ( v c l s ′ ) s'(T, I)=g_w'(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}') s(T,I)=gw(wcls)gv(vcls). then , By replacing the equation ( 1 ) (1) (1) Medium s s s and s ′ s' s To calculate pseudo tags q i 2 t \textbf{q}^{i2t} qi2t and q t 2 i \textbf{q}^{t2i} qt2i. ITC M o D \text{ITC}_{MoD} ITCMoD The loss function is defined as :
L i t c m o d = ( 1 − α ) L i t c + α 2 E ( I , T ) ∼ D [ KL ( q i 2 t ( I ) ∥ p i 2 t ( I ) ) + KL ( q t 2 i ( T ) ∥ p t 2 i ( T ) ) ] (6) \mathcal{L}_{itc}^{mod}=(1-\alpha)\mathcal{L}_{itc}+\frac{\alpha}{2}\mathbb{E}_{(I,T)\sim D}\big[\text{KL}(\textbf{q}^{i2t}(I)\parallel\textbf{p}^{i2t}(I))+\text{KL}(\textbf{q}^{t2i}(T)\parallel\textbf{p}^{t2i}(T))\big] \tag{6} Litcmod=(1α)Litc+2αE(I,T)D[KL(qi2t(I)pi2t(I))+KL(qt2i(T)pt2i(T))](6)
Similarly , about MLM \text{MLM} MLM, Make q m s k ( I , T ^ ) \textbf{q}^{msk}(I,\hat{T}) qmsk(I,T^) Represents the momentum model for shadowing token The probability of prediction , MLM M o D \text{MLM}_{MoD} MLMMoD The loss function is :
L m l m m o d = ( 1 − α ) L m l m + α E ( I , T ^ ) ∼ D KL ( q m s k ( I , T ^ ) ∥ p m s k ( I , T ^ ) ) (7) \mathcal{L}_{mlm}^{mod}=(1-\alpha)\mathcal{L}_{mlm}+\alpha\mathbb{E}_{(I,\hat{T})\sim D}\text{KL}(\textbf{q}^{msk}(I,\hat{T})\parallel\textbf{p}^{msk}(I,\hat{T})) \tag{7} Lmlmmod=(1α)Lmlm+αE(I,T^)DKL(qmsk(I,T^)pmsk(I,T^))(7)
Above picture , Showing the pseudo target top-5 The candidate , It effectively captures the relevant words of an image / Text .

​ The author will MoD \text{MoD} MoD Apply to downstream tasks . The final loss function of each task is the weighted combination of the original task loss function , And model prediction and pseudo label KL \text{KL} KL The divergence . For simplicity , Set weights for all pre training and downstream tasks α = 0.4 \alpha=0.4 α=0.4.

4. Pre training dataset

​ follow UNITER \text{UNITER} UNITER, Use two network datasets ( Conceptual Captions , SBU Captions ) (\text{Conceptual Captions},\text{SBU Captions}) (Conceptual Captions,SBU Captions) And two domain datasets ( COCO , Visual Genome ) (\text{COCO},\text{Visual Genome}) (COCO,Visual Genome). The number of unique images is 4M, also image-text Yes The number is 5.1M. In order to show the expansibility of this method in large-scale network data , The author also introduces more noisy Conceptual 12M \text{Conceptual 12M} Conceptual 12M Data sets , The total number of images increased to 14.1M.

5. Implementation details

​ The model in this paper is composed of 123.7M Parametric BERT b a s e \text{BERT}_{base} BERTbase And have 85.8M Parametric ViT-B/16 \text{ViT-B/16} ViT-B/16. stay 8 block NVIDIA A100 GPUs Upper use batch size by 512 Way to pre train the model 30 individual epochs. Use with gradient attenuation of 0.02 Of AdamW Optimizer . before 1000 The learning rate is preheated to 1 e − 4 1e^{-4} 1e4, And then decay to... According to the cosine schedule 1 e − 5 1e^{-5} 1e5. During pre training , Using random image clipping resolution 256 × 256 256\times 256 256×256 As input , And apply RandAugment \text{RandAugment} RandAugment. In the process of fine-tuning , Increase image resolution to 384 × 384 384\times 384 384×384, And for the image patches Insert position code . The momentum parameter of the updated momentum model is set to 0.995, be used for image-text The queue size for comparative learning is set to 65536. At the first epoch Medium distillation weight α \alpha α in 0 Linear increase to 0.4.

3、 ... and 、 From the perspective of maximizing mutual information

​ In this section , Provide a ALBEF \text{ALBEF} ALBEF Alternative perspectives for , And it shows that it is to maximize image-text Yes The lower boundary of mutual information from different perspectives . ITC \text{ITC} ITC MLM \text{MLM} MLM and MoD \text{MoD} MoD Can be interpreted as different ways of generating views .

​ Officially , Define two random variables a a a and b b b Two different perspectives for a data point . In self supervised learning , a a a and b b b Are two enhancement samples of the same image . In vision - Language means learning , consider a a a and b b b yes image-text Capture different variants of the same semantics . The goal is to learn a representation that doesn't change with perspective . This can be achieved by maximizing a a a and b b b To maximize mutual information . In practice , By minimizing InfoNCE \text{InfoNCE} InfoNCE Loss function to maximize MI(a,b) \text{MI(a,b)} MI(a,b) The lower boundary of .
L N C E = − E p ( a , b ) [ log exp ⁡ ( s ( a , b ) ) ∑ b ^ ∈ B ^ exp ⁡ ( s ( a , b ^ ) ) ] (8) \mathcal{L}_{NCE}=-\mathbb{E}_{p(a,b)}\Bigg[\text{log}\frac{\exp(s(a,b))}{\sum_{\hat{b}\in\hat{B}}\exp(s(a,\hat{b}))}\Bigg] \tag{8} LNCE=Ep(a,b)[logb^B^exp(s(a,b^))exp(s(a,b))](8)
among , s ( a , b ) s(a,b) s(a,b) It's a scoring function , B ^ \hat{B} B^ Contains positive samples b b b and ∣ B ^ − 1 ∣ |\hat{B}-1| B^1 Negative samples .

​ In this paper, the ITC \text{ITC} ITC The loss function can be rewritten as :
L i t c = − 1 2 E p ( I , T ) [ log ⁡ exp ⁡ ( s ( I , T ) / τ ) ∑ m = 1 M exp ⁡ ( s ( I , T m ) / τ ) + log ⁡ exp ⁡ ( s ( T , I ) / τ ) ∑ m = 1 M exp ⁡ ( s ( T , I m ) / τ ) ] (9) \mathcal{L}_{itc}=-\frac{1}{2}\mathbb{E}_{p(I,T)}\Big[\log\frac{\exp(s(I,T)/\tau)}{\sum_{m=1}^M\exp(s(I,T_m)/\tau)}+\log\frac{\exp(s(T,I)/\tau)}{\sum_{m=1}^M\exp(s(T,I_m)/\tau)} \Big] \tag{9} Litc=21Ep(I,T)[logm=1Mexp(s(I,Tm)/τ)exp(s(I,T)/τ)+logm=1Mexp(s(T,Im)/τ)exp(s(T,I)/τ)](9)
​ To minimize the L i t c \mathcal{L}_{itc} Litc Can be seen as a maximized symmetric version of InfoNCE \text{InfoNCE} InfoNCE. therefore , ITC \text{ITC} ITC Take two independent modes as image-text Yes Two views of , The single-mode encoder is trained to maximize the angle of view of the image and text MI \text{MI} MI.

MLM \text{MLM} MLM It can also be interpreted as masking the maximum mutual information between a word and its context . say concretely , Can be rewritten MLM \text{MLM} MLM The loss function is
L m l m = − E p ( I , T ^ ) [ log ⁡ exp ⁡ ( ψ ( y m s k ) ⊤ f ( I , T ^ ) ) ∑ y ∈ V exp ⁡ ( ψ ( y ) ⊤ f ( I , T ^ ) ) ] (10) \mathcal{L}_{mlm}=-\mathbb{E}_{p(I, \hat{T})}\big[\log\frac{\exp(\psi(y^{msk})^\top f(I,\hat{T}))}{\sum_{y\in\mathcal{V}}\exp(\psi(y)^\top f(I,\hat{T}))}\big] \tag{10} Lmlm=Ep(I,T^)[logyVexp(ψ(y)f(I,T^))exp(ψ(ymsk)f(I,T^))](10)

among , ψ ( y ) : V → R d \psi(y):\mathcal{V}\rightarrow \mathbb{R}^d ψ(y):VRd It is the output layer of multimode encoder lookup function , Map words token y y y To a vector , also V \mathcal{V} V Is a collection of the whole vocabulary , also f ( I , T ^ ) f(I,\hat{T}) f(I,T^) Is a final function that returns the masking context corresponding to the multimode encoder hidden state. therefore , MLM \text{MLM} MLM take image-text Yes Think of it as two views :(1) A randomly chosen word token;(2) Images + The context of the obscured word ;

ITC \text{ITC} ITC and MLM \text{MLM} MLM By getting from image-text Yes Take some information to generate views . Momentum distillation in this paper can be seen as generating alternative views from the entire distribution . With the equation ( 6 ) (6) (6) Of ITC M o D \text{ITC}_{MoD} ITCMoD For example , To minimize the KL ( p i 2 t ( I ) , q i 2 t ( I ) ) \text{KL}(\textbf{p}^{i2t}(I),\textbf{q}^{i2t}(I)) KL(pi2t(I),qi2t(I)) Equivalent to minimizing the following objective function
− ∑ m q m i 2 t ( I ) log ⁡ p m i 2 t ( I ) = − s u m m exp ⁡ ( s ′ ( I , T m ) / τ ) ∑ m = 1 M exp ⁡ ( s ′ ( I , T m ) / τ ) log ⁡ exp ⁡ ( s ( I , T m ) / τ ) ∑ m = 1 M exp ⁡ ( s ( I , T m ) / τ ) (11) -\sum_{m} q_m^{i2t}(I)\log p_m^{i2t}(I)=-sum_m \frac{\exp(s'(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s'(I,T_m)/\tau)}\log \frac{\exp(s(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s(I, T_m)/\tau)} \tag{11} mqmi2t(I)logpmi2t(I)=summm=1Mexp(s(I,Tm)/τ)exp(s(I,Tm)/τ)logm=1Mexp(s(I,Tm)/τ)exp(s(I,Tm)/τ)(11)
It maximizes images that share similar semantics with text I I I Mutual information of MI ( I , T m ) \textbf{MI}(I,T_m) MI(I,Tm), Because these texts will have a larger q m i 2 t ( I ) q^{i2t}_m(I) qmi2t(I). Similarly , ITC M o D \text{ITC}_{MoD} ITCMoD It also maximizes the similarity to the image T T T Of MI ( I m , T ) \textbf{MI}(I_m,T) MI(Im,T). You can follow the same way , MLM M o D \text{MLM}_{MoD} MLMMoD To obscure words y m s k y^{msk} ymsk Generate optional views y ′ ∈ V y'\in\mathcal{V} yV, And maximize it y ′ y' y and ( I , T ^ ) (I,\hat{T}) (I,T^) Maximize information for MI \text{MI} MI. therefore , Momentum distillation can be seen as performing data enhancement on the original view . The momentum model is generated with the original image-text Different views , And encourage the basic model to learn the representation of view invariant semantic information .

Four 、 The downstream V+L \text{V+L} V+L Mission

 Please add a picture description

​ Five downstream V+L \text{V+L} V+L Apply the pre training model to the task . The following describes each task and fine tuning strategy .

1. Image-Text \text{Image-Text} Image-Text retrieval

Image-Text \text{Image-Text} Image-Text Contains two subtasks :image-to-text retrieval ( TR ) (\text{TR}) (TR) and text-to-image retrieval ( IR ) (\text{IR}) (IR). stay Flickr30K \text{Flickr30K} Flickr30K and COCO \text{COCO} COCO On a benchmark ALBEF \text{ALBEF} ALBEF, And use training samples from each data set to fine tune the pre training model . about Flickr30K \text{Flickr30K} Flickr30K Upper zero-shot retrieval , stay COCO \text{COCO} COCO Fine tune the model to evaluate . In the process of fine-tuning , Joint optimization ITC \text{ITC} ITC Loss function and ITM \text{ITM} ITM Loss function . ITC \text{ITC} ITC Learning based on single-mode similarity image-text Scoring function , and ITM \text{ITM} ITM Modeling fine-grained interactions between images and text to predict matching scores . Because each image in the downstream data set contains multiple texts , change ITC \text{ITC} ITC To consider multiple positive samples in the queue , The probability of each positive sample is 1. In the process of inference , For all the image-text Yes Calculate the characteristic similarity score s i t c s_{itc} sitc. then , use top-k \text{top-k} top-k As candidates and calculate their ITM \text{ITM} ITM fraction s i t m s_{itm} sitm Used to sort . because k k k Can be set very small , The speed of inference will be much faster .

2. Visual Entailment \text{Visual Entailment} Visual Entailment

Visual Entailment \text{Visual Entailment} Visual Entailment Used to predict whether images and text contain 、 Fine-grained visual reasoning tasks with equivalent or opposite relationships . Follow the model UNITER \text{UNITER} UNITER And consider Visual Entailment \text{Visual Entailment} Visual Entailment As a three category problem , And then in [CLS] The multimode encoder is expressed on the basis of MLP \text{MLP} MLP To predict category probability .

3. Visual Question Answering(VQA) \text{Visual Question Answering(VQA)} Visual Question Answering(VQA)

​ Given an image and a problem , VQA \text{VQA} VQA Need a model to predict an answer . Different from the existing methods, it will VQA \text{VQA} VQA As a multi answer classification question , The author will VQA \text{VQA} VQA Generate questions as an answer . say concretely , Use 6 Layer of Transformer \text{Transformer} Transformer Decoder to generate the answer . Pictured above ( a ) (a) (a) Shown , The autoregressive answer decoder receives multimodal embedding , And then [CLS] The vector of is used as the initial input of the decoder token. alike ,[SEP] It will be appended to the output of the decoder to indicate the completion of the generation . The answer decoder uses the pre training weight of the multi model encoder to initialize , And use conditional language loss function to fine tune . In order to make a fair comparison with the existing methods , In the process of reasoning, the constraint decoder can only start from 3192 Generated from subsequent answers .

4. Natural language for visual reasoning NLVR \text{NLVR} NLVR

NLVR \text{NLVR} NLVR The model needs to determine whether a text is a description of a pair of images . The authors extend the multimodal decoder to enable it to reason on two images . Pictured above ( b ) (b) (b) Shown , Each layer of the multimode encoder is repeated as two consecutive layers Transformer \text{Transformer} Transformer block , Each block contains a self - attention layer 、 A cross attention layer and a forward propagation layer . Two blocks in each layer will be initialized with the same pre training weight , Two cross attention can share the same linear projection weight . In the training model , Two blocks receive two embedded sets of image pairs . In the multimode encoder [CLS] Means to append a MLP \text{MLP} MLP A classifier is used to predict .

​ about NLVR \text{NLVR} NLVR, Perform additional pre training steps to prepare a new multimodal encoder for the encoded image pair . The author designed a text assignment task ( text-assignment,TA ) (\text{text-assignment,TA}) (text-assignment,TA): Given an image and text pair , The model needs to assign text to the first image 、 The second image 、 Or not at all . The author defines it as a three classification problem , And in [CLS] Use... On the expression FC \text{FC} FC Layer to predict allocation . stay 4 M 4M 4M Use... On images TA \text{TA} TA Preliminary training 1 individual epoch.

5. Visual Grounding \text{Visual Grounding} Visual Grounding

Visual Grounding \text{Visual Grounding} Visual Grounding The goal of is to locate the area in the image that is related to a specific text description . The author studies the weak supervision setting , That is to say, there are no marks bounding box \text{bounding box} bounding box. The author in RefCOCO+ \text{RefCOCO+} RefCOCO+ Perform experiments on datasets , And use the image-text Retrieve the results of the same policy image-text Monitor to fine tune the model . In the process of inference , The author extends Grad-CAM \text{Grad-CAM} Grad-CAM To get the heat map , And use them to sort the detected objects .

5、 ... and 、 experiment

A little

原网站

版权声明
本文为[BQW_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206112250598433.html