当前位置:网站首页>[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation
[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation
2022-06-11 23:08:00 【BQW_】
Address of thesis :https://arxiv.org/pdf/2107.07651.pdf
Related blog :
【 natural language processing 】【 Multimodal 】CLIP: Learning transferable visual models from natural language supervision
【 natural language processing 】【 Multimodal 】ViT-BERT: The unified basic model is pre trained on non image text pair data
【 natural language processing 】【 Multimodal 】BLIP: Bootstrap language image pre training for unified visual language understanding and generation
【 natural language processing 】【 Multimodal 】FLAVA: A basic language and visual alignment model
【 natural language processing 】【 Multimodal 】SIMVLM: Simple visual language model pre training based on weak supervision
【 natural language processing 】【 Multimodal 】UniT: Based on Unification Transformer Multimodal multi task learning
【 natural language processing 】【 Multimodal 】Product1M: Weakly supervised case level product retrieval based on cross modal pre training
【 natural language processing 】【 Multimodal 】ALBEF: Visual language representation learning based on momentum distillation
One 、 brief introduction
Visual language pre training (Vision-and-Language Pre-training,VLP) \text{(Vision-and-Language Pre-training,VLP)} (Vision-and-Language Pre-training,VLP) The goal is to start on a large scale image-text Yes Learn multimodal representation , Used to improve downstream visual language tasks ( Vision-and-Language,V+L ) (\text{Vision-and-Language,V+L}) (Vision-and-Language,V+L). Many existing VLP \text{VLP} VLP The method relies on a pre trained target detector to extract regions based on image features , A multi-modal code is used to fuse image features and word features . Multimodal coders are trained to solve tasks that require joint understanding of images and texts , for example :masked language modeling and image-text matching.
Although effective , But these VLP \text{VLP} VLP The framework has several relational limitations :(1) Image features and word embedding are in their own space , This makes it more challenging for multimodal coders to learn to model their interactions ;(2) The standard and calculation of target detector are very expensive , Because it needs to be manually marked during pre training bounding box, And it is a high-resolution image at the time of inference ;(3) Widely used image-text Data sets are collected from the network and there are a large class of noise , Existing image MLM \text{MLM} MLM Such a pre training target may over fit the noisy text , And reduce the generalization performance of the model .
The author puts forward ALBEF(ALign BEfore Fuse) \text{ALBEF(ALign BEfore Fuse)} ALBEF(ALign BEfore Fuse), A new VLP \text{VLP} VLP Framework to address these limitations . First, an image encoder and a text encoder without detector will be used to encode the image and text independently . then , Multimodal coders fuse image features and text features through a cross modal attention mechanism . The author introduces an intermediate image-text Compare the loss function ( ITC ) (\text{ITC}) (ITC), It is applied to the representation of single-mode encoder , It has three purposes :(1) Align image features with text features , It makes it easier for the multimode encoder to perform cross modal learning ;(2) Improve the single-mode encoder to better understand the semantics of images and texts ;(3) It can learn a common low dimensional space to embed images and text , Find more informative samples by comparing difficult samples mining .
To improve learning under noise supervision , Momentum distillation is proposed MoD \text{MoD} MoD, A simple way to enable the model to take advantage of larger noisy data sets . In the process of training , Maintain a momentum version of the model by averaging the model parameters , The momentum model is used to generate pseudo tags as additional supervision . Use MoD \text{MoD} MoD, The model should not be penalized for producing reasonable output that is different from the network annotations . MoD \text{MoD} MoD Not only can pre training be improved , It can also improve downstream tasks .
From the perspective of maximum mutual information, the author provides ALBEF \text{ALBEF} ALBEF Theoretical analysis of . Specially , ITC \text{ITC} ITC and MLM \text{MLM} MLM Maximized image-text The lower boundary of mutual information for different views , These views are generated by taking partial information from each pair . From this point of view , Momentum distillation can be interpreted as generating new views with the same semantics . therefore , ALBEF \text{ALBEF} ALBEF Be able to learn visual language representation that does not change the semantic representation .
The author is in various downstream V+L \text{V+L} V+L The mission proved ALBEF \text{ALBEF} ALBEF The effectiveness of the , contain image-text retrieval 、 Visual Q & A 、 Visual reasoning 、 Visual implication and weak supervision visual grounding. ALBEF \text{ALBEF} ALBEF More than the existing state-of-the-art The method achieves significant improvement . stay image-text Searching , It is better than those methods that are pre trained on a larger data set ( CLIP \text{CLIP} CLIP and ALIGN \text{ALIGN} ALIGN). stay VQA \text{VQA} VQA and NLVR \text{NLVR} NLVR On , Compare with state-of-the-art Method VILLA \text{VILLA} VILLA, Its implementation 2.37% and 3.84% Improvement , And it has faster reasoning speed . Besides , The author also uses Grad-CAM \text{Grad-CAM} Grad-CAM Yes ALBEF \text{ALBEF} ALBEF Qualitative and quantitative analysis are carried out .
Two 、 ALBEF \text{ALBEF} ALBEF Preliminary training

1. Model architecture
As shown in the figure above , ALBEF \text{ALBEF} ALBEF Contains an image encoder 、 A text encoder And a multimode encoder . Use 12 Layer of ViT-B/16 \text{ViT-B/16} ViT-B/16 As an image encoder , And then use it in ImageNet-1K \text{ImageNet-1K} ImageNet-1K Initialize with the weight obtained from the upper pre training . An input image I \text{I} I Encoded as an embedded sequence : { v c l s , v 1 , … , v N } \{\textbf{v}_{cls},\textbf{v}_1,\dots,\textbf{v}_N\} { vcls,v1,…,vN}, among v c l s v_{cls} vcls yes [CLS] Embedding vector of . Use 6 Layer of Transformer \text{Transformer} Transformer As text encoder and multimode encoder . The text encoder uses BERT b a s e \text{BERT}_{base} BERTbase Before 6 Layer initialization , The multimode encoder layer uses BERT b a s e \text{BERT}_{base} BERTbase After 6 Layer initialization . The text encoder will enter text T T T Convert to an embedded vector sequence { w c l s , w 1 , … , w N } \{\textbf{w}_{cls},\textbf{w}_1,\dots,\textbf{w}_N\} { wcls,w1,…,wN}, It will be input to the multimode encoder . In each layer of the multi-modal encoder, image features and text features are fused through the attention mechanism .
2. Pre training objectives
Pre training with three objective functions ALBEF \text{ALBEF} ALBEF: Single mode encoder image-text Comparative learning ( ITC ) (\text{ITC}) (ITC), Masking language model on multimode encoder ( MLM ) (\text{MLM}) (MLM) and image-text matching ( ITM ) (\text{ITM}) (ITM). Besides , Here we also use online hard negative sample mining comparison to improve ITM \text{ITM} ITM.
2.1 Image-text \text{Image-text} Image-text Comparative learning ( ITC ) (\text{ITC}) (ITC)
The goal of the loss function is to better learn the unimodal representation before fusion . It will learn a similar function s = g v ( v c l s ) ⊤ g w ( w c l s ) s=g_v(\textbf{v}_{cls})^\top g_w(\textbf{w}_{cls}) s=gv(vcls)⊤gw(wcls), Make parallel image-text Yes Have higher similarity scores . g v g_v gv and g w g_w gw Yes, it will [CLS] The embedded vector is mapped to a linear transformation of normalized low dimensional representation . suffer MoCo \text{MoCo} MoCo inspire , Two queues are maintained to store the most recent... From the momentum singlemode encoder M M M individual image-text Express . The normalized feature from the momentum encoder is expressed as g v ′ ( v c l s ′ ) g_v'(\textbf{v}_{cls}') gv′(vcls′) and g w ′ ( w c l s ′ ) g_w'(\textbf{w}_{cls}') gw′(wcls′). Definition s ( I , T ) = g v ( v c l s ) ⊤ g w ′ ( w c l s ′ ) s(I,T)=g_v(\textbf{v}_{cls})^\top g_w'(\textbf{w}_{cls}') s(I,T)=gv(vcls)⊤gw′(wcls′) And s ( T , I ) = g w ( w c l s ) ⊤ g v ′ ( v c l s ′ ) s(T,I)=g_w(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}') s(T,I)=gw(wcls)⊤gv′(vcls′).
For each image and text , Calculation image-to-text and text-to-image The similarity of is :
p m i 2 t ( I ) = e x p ( s ( I , T m ) / τ ) ∑ m = 1 M e x p ( s ( I , T m ) τ ) , p m t 2 i = e x p ( s ( T , I m ) / τ ) ∑ m = 1 M e x p ( s ( T , I m ) / τ ) (1) p_m^{i2t}(I)=\frac{exp(s(I,T_m)/\tau)}{\sum_{m=1}^M exp(s(I,T_m)\tau)},\quad p_m^{t2i}=\frac{exp(s(T,I_m)/\tau)}{\sum_{m=1}^M exp(s(T, I_m)/\tau)} \tag{1} pmi2t(I)=∑m=1Mexp(s(I,Tm)τ)exp(s(I,Tm)/τ),pmt2i=∑m=1Mexp(s(T,Im)/τ)exp(s(T,Im)/τ)(1)
among , τ \tau τ Is learnable temperature Parameters . Make y i 2 t ( I ) \textbf{y}^{i2t}(I) yi2t(I) and y t 2 i ( T ) \textbf{y}^{t2i}(T) yt2i(T) To express true one-hot Similarity degree , Where the negative sample pair has probability 0 And the probability of positive sample pairs is 1.image-text The comparison loss function is defined as p \textbf{p} p and y \textbf{y} y Cross entropy of H H H:
L i t c = 1 2 E ( I , T ) ∼ D [ H ( y i 2 t ( I ) , p i 2 t ( I ) ) + H ( y t 2 i ( T ) , p t 2 i ( T ) ) ] (2) \mathcal{L}_{itc}=\frac{1}{2}\mathbb{E}_{(I,T)\sim D}\big[H(\textbf{y}^{i2t}(I),\textbf{p}^{i2t}(I))+H(\textbf{y}^{t2i}(T),\textbf{p}^{t2i}(T))\big] \tag{2} Litc=21E(I,T)∼D[H(yi2t(I),pi2t(I))+H(yt2i(T),pt2i(T))](2)
2.2 Masking language model ( MLM ) (\text{MLM}) (MLM)
MLM \text{MLM} MLM Can use images and text to predict the obscured words . With 15% The probability of random masking input tokens, And use special [MASK] token Replace . Make T ^ \hat{T} T^ Represented as obscured text , also p m s k ( I , T ^ ) \textbf{p}^{msk}(I,\hat{T}) pmsk(I,T^) Represents the model's effect on shadowing token The probability of prediction . MLM \text{MLM} MLM Minimize cross entropy loss :
L m l m = E ( I , T ^ ) ∼ D H ( y m s k , p m s k ( I , T ^ ) ) (3) \mathcal{L}_{mlm}=\mathbb{E}_{(I,\hat{T})\sim D} H(\textbf{y}^{msk},\textbf{p}^{msk}(I,\hat{T})) \tag{3} Lmlm=E(I,T^)∼DH(ymsk,pmsk(I,T^))(3)
among , y m s k \textbf{y}^{msk} ymsk It's a one-hot Thesaurus distribution , The real token The probability of 1.
2.3 Image-Text \text{Image-Text} Image-Text matching ( ITM ) (\text{ITM}) (ITM)
ITM \text{ITM} ITM Predict whether the image and text pairs match or do not match . Use a multimode encoder for [CLS] The output embedded vector is used as image-text Yes Joint representation of , And through the full link layer with a softmax To predict the probability of two categories p i t m p^{itm} pitm. ITM \text{ITM} ITM The loss function is :
L i t m = E ( I , T ) ∼ D H ( y i t m , p i t m ( I , T ) ) (4) \mathcal{L}_{itm}=\mathbb{E}_{(I,T)\sim D} H(\textbf{y}^{itm},\textbf{p}^{itm}(I,T)) \tag{4} Litm=E(I,T)∼DH(yitm,pitm(I,T))(4)
among , y i t m \textbf{y}^{itm} yitm It's a two-dimensional one-hot Vector representation .
Besides , The author puts forward an aim at ITM \text{ITM} ITM Hard negative sample sampling strategy for tasks . If image-text Yes Share similar semantics but differ in fine-grained details , Then it can be considered as a difficult negative sample . Using the equation ( 1 ) (1) (1) To look for the comparative similarity in batch Internal hard negative samples . about batch Each image in , From the same batch Sampling a negative text according to the comparative similarity distribution in , The more similar the text is to the image, the more likely it will be sampled . Similarly , Sample a hard negative image for each text .
- ALBEF \text{ALBEF} ALBEF All the pre training objective functions of are :
L = L i t c + L m l m + L i t m (5) \mathcal{L}=\mathcal{L}_{itc}+\mathcal{L}_{mlm}+\mathcal{L}_{itm} \tag{5} L=Litc+Lmlm+Litm(5)
3. Momentum distillation
For pre training image-text Yes The data is mainly collected from the network , And it contains noise . Positive sample pairs usually have weak correlations : The text may contain words that are not related to the image , Or the image may contain entities not described in the text . about ITC \text{ITC} ITC Study , The negative text of an image may also match the content of the image . about MLM \text{MLM} MLM, There are some words that are different from annotations that can better describe the image . However , ITC \text{ITC} ITC and MLM \text{MLM} MLM Of one-hot The tag penalizes all negative predictions , And ignore these correctness .
To solve this problem , The author proposes to learn from the pseudo target generated by momentum model . Momentum model is composed of the exponential moving average version of the single-mode encoder and multi-mode encoder 、 The evolving teacher model . In the process of training , Train the basic model to match its prediction with that of the momentum model . Specially , about ITC \text{ITC} ITC, First, the characteristics of the momentum single-mode encoder are used to calculate image-text The similarity is s ′ ( I , T ) = g v ′ ( v c l s ′ ) ⊤ g w ′ ( w c l s ′ ) s'(I,T)=g_v'(\textbf{v}_{cls}')^\top g_w'(\textbf{w}_{cls}') s′(I,T)=gv′(vcls′)⊤gw′(wcls′) and s ′ ( T , I ) = g w ′ ( w c l s ) ⊤ g v ′ ( v c l s ′ ) s'(T, I)=g_w'(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}') s′(T,I)=gw′(wcls)⊤gv′(vcls′). then , By replacing the equation ( 1 ) (1) (1) Medium s s s and s ′ s' s′ To calculate pseudo tags q i 2 t \textbf{q}^{i2t} qi2t and q t 2 i \textbf{q}^{t2i} qt2i. ITC M o D \text{ITC}_{MoD} ITCMoD The loss function is defined as :
L i t c m o d = ( 1 − α ) L i t c + α 2 E ( I , T ) ∼ D [ KL ( q i 2 t ( I ) ∥ p i 2 t ( I ) ) + KL ( q t 2 i ( T ) ∥ p t 2 i ( T ) ) ] (6) \mathcal{L}_{itc}^{mod}=(1-\alpha)\mathcal{L}_{itc}+\frac{\alpha}{2}\mathbb{E}_{(I,T)\sim D}\big[\text{KL}(\textbf{q}^{i2t}(I)\parallel\textbf{p}^{i2t}(I))+\text{KL}(\textbf{q}^{t2i}(T)\parallel\textbf{p}^{t2i}(T))\big] \tag{6} Litcmod=(1−α)Litc+2αE(I,T)∼D[KL(qi2t(I)∥pi2t(I))+KL(qt2i(T)∥pt2i(T))](6)
Similarly , about MLM \text{MLM} MLM, Make q m s k ( I , T ^ ) \textbf{q}^{msk}(I,\hat{T}) qmsk(I,T^) Represents the momentum model for shadowing token The probability of prediction , MLM M o D \text{MLM}_{MoD} MLMMoD The loss function is :
L m l m m o d = ( 1 − α ) L m l m + α E ( I , T ^ ) ∼ D KL ( q m s k ( I , T ^ ) ∥ p m s k ( I , T ^ ) ) (7) \mathcal{L}_{mlm}^{mod}=(1-\alpha)\mathcal{L}_{mlm}+\alpha\mathbb{E}_{(I,\hat{T})\sim D}\text{KL}(\textbf{q}^{msk}(I,\hat{T})\parallel\textbf{p}^{msk}(I,\hat{T})) \tag{7} Lmlmmod=(1−α)Lmlm+αE(I,T^)∼DKL(qmsk(I,T^)∥pmsk(I,T^))(7)
Above picture , Showing the pseudo target top-5 The candidate , It effectively captures the relevant words of an image / Text .
The author will MoD \text{MoD} MoD Apply to downstream tasks . The final loss function of each task is the weighted combination of the original task loss function , And model prediction and pseudo label KL \text{KL} KL The divergence . For simplicity , Set weights for all pre training and downstream tasks α = 0.4 \alpha=0.4 α=0.4.
4. Pre training dataset
follow UNITER \text{UNITER} UNITER, Use two network datasets ( Conceptual Captions , SBU Captions ) (\text{Conceptual Captions},\text{SBU Captions}) (Conceptual Captions,SBU Captions) And two domain datasets ( COCO , Visual Genome ) (\text{COCO},\text{Visual Genome}) (COCO,Visual Genome). The number of unique images is 4M, also image-text Yes The number is 5.1M. In order to show the expansibility of this method in large-scale network data , The author also introduces more noisy Conceptual 12M \text{Conceptual 12M} Conceptual 12M Data sets , The total number of images increased to 14.1M.
5. Implementation details
The model in this paper is composed of 123.7M Parametric BERT b a s e \text{BERT}_{base} BERTbase And have 85.8M Parametric ViT-B/16 \text{ViT-B/16} ViT-B/16. stay 8 block NVIDIA A100 GPUs Upper use batch size by 512 Way to pre train the model 30 individual epochs. Use with gradient attenuation of 0.02 Of AdamW Optimizer . before 1000 The learning rate is preheated to 1 e − 4 1e^{-4} 1e−4, And then decay to... According to the cosine schedule 1 e − 5 1e^{-5} 1e−5. During pre training , Using random image clipping resolution 256 × 256 256\times 256 256×256 As input , And apply RandAugment \text{RandAugment} RandAugment. In the process of fine-tuning , Increase image resolution to 384 × 384 384\times 384 384×384, And for the image patches Insert position code . The momentum parameter of the updated momentum model is set to 0.995, be used for image-text The queue size for comparative learning is set to 65536. At the first epoch Medium distillation weight α \alpha α in 0 Linear increase to 0.4.
3、 ... and 、 From the perspective of maximizing mutual information
In this section , Provide a ALBEF \text{ALBEF} ALBEF Alternative perspectives for , And it shows that it is to maximize image-text Yes The lower boundary of mutual information from different perspectives . ITC \text{ITC} ITC、 MLM \text{MLM} MLM and MoD \text{MoD} MoD Can be interpreted as different ways of generating views .
Officially , Define two random variables a a a and b b b Two different perspectives for a data point . In self supervised learning , a a a and b b b Are two enhancement samples of the same image . In vision - Language means learning , consider a a a and b b b yes image-text Capture different variants of the same semantics . The goal is to learn a representation that doesn't change with perspective . This can be achieved by maximizing a a a and b b b To maximize mutual information . In practice , By minimizing InfoNCE \text{InfoNCE} InfoNCE Loss function to maximize MI(a,b) \text{MI(a,b)} MI(a,b) The lower boundary of .
L N C E = − E p ( a , b ) [ log exp ( s ( a , b ) ) ∑ b ^ ∈ B ^ exp ( s ( a , b ^ ) ) ] (8) \mathcal{L}_{NCE}=-\mathbb{E}_{p(a,b)}\Bigg[\text{log}\frac{\exp(s(a,b))}{\sum_{\hat{b}\in\hat{B}}\exp(s(a,\hat{b}))}\Bigg] \tag{8} LNCE=−Ep(a,b)[log∑b^∈B^exp(s(a,b^))exp(s(a,b))](8)
among , s ( a , b ) s(a,b) s(a,b) It's a scoring function , B ^ \hat{B} B^ Contains positive samples b b b and ∣ B ^ − 1 ∣ |\hat{B}-1| ∣B^−1∣ Negative samples .
In this paper, the ITC \text{ITC} ITC The loss function can be rewritten as :
L i t c = − 1 2 E p ( I , T ) [ log exp ( s ( I , T ) / τ ) ∑ m = 1 M exp ( s ( I , T m ) / τ ) + log exp ( s ( T , I ) / τ ) ∑ m = 1 M exp ( s ( T , I m ) / τ ) ] (9) \mathcal{L}_{itc}=-\frac{1}{2}\mathbb{E}_{p(I,T)}\Big[\log\frac{\exp(s(I,T)/\tau)}{\sum_{m=1}^M\exp(s(I,T_m)/\tau)}+\log\frac{\exp(s(T,I)/\tau)}{\sum_{m=1}^M\exp(s(T,I_m)/\tau)} \Big] \tag{9} Litc=−21Ep(I,T)[log∑m=1Mexp(s(I,Tm)/τ)exp(s(I,T)/τ)+log∑m=1Mexp(s(T,Im)/τ)exp(s(T,I)/τ)](9)
To minimize the L i t c \mathcal{L}_{itc} Litc Can be seen as a maximized symmetric version of InfoNCE \text{InfoNCE} InfoNCE. therefore , ITC \text{ITC} ITC Take two independent modes as image-text Yes Two views of , The single-mode encoder is trained to maximize the angle of view of the image and text MI \text{MI} MI.
MLM \text{MLM} MLM It can also be interpreted as masking the maximum mutual information between a word and its context . say concretely , Can be rewritten MLM \text{MLM} MLM The loss function is
L m l m = − E p ( I , T ^ ) [ log exp ( ψ ( y m s k ) ⊤ f ( I , T ^ ) ) ∑ y ∈ V exp ( ψ ( y ) ⊤ f ( I , T ^ ) ) ] (10) \mathcal{L}_{mlm}=-\mathbb{E}_{p(I, \hat{T})}\big[\log\frac{\exp(\psi(y^{msk})^\top f(I,\hat{T}))}{\sum_{y\in\mathcal{V}}\exp(\psi(y)^\top f(I,\hat{T}))}\big] \tag{10} Lmlm=−Ep(I,T^)[log∑y∈Vexp(ψ(y)⊤f(I,T^))exp(ψ(ymsk)⊤f(I,T^))](10)
among , ψ ( y ) : V → R d \psi(y):\mathcal{V}\rightarrow \mathbb{R}^d ψ(y):V→Rd It is the output layer of multimode encoder lookup function , Map words token y y y To a vector , also V \mathcal{V} V Is a collection of the whole vocabulary , also f ( I , T ^ ) f(I,\hat{T}) f(I,T^) Is a final function that returns the masking context corresponding to the multimode encoder hidden state. therefore , MLM \text{MLM} MLM take image-text Yes Think of it as two views :(1) A randomly chosen word token;(2) Images + The context of the obscured word ;
ITC \text{ITC} ITC and MLM \text{MLM} MLM By getting from image-text Yes Take some information to generate views . Momentum distillation in this paper can be seen as generating alternative views from the entire distribution . With the equation ( 6 ) (6) (6) Of ITC M o D \text{ITC}_{MoD} ITCMoD For example , To minimize the KL ( p i 2 t ( I ) , q i 2 t ( I ) ) \text{KL}(\textbf{p}^{i2t}(I),\textbf{q}^{i2t}(I)) KL(pi2t(I),qi2t(I)) Equivalent to minimizing the following objective function
− ∑ m q m i 2 t ( I ) log p m i 2 t ( I ) = − s u m m exp ( s ′ ( I , T m ) / τ ) ∑ m = 1 M exp ( s ′ ( I , T m ) / τ ) log exp ( s ( I , T m ) / τ ) ∑ m = 1 M exp ( s ( I , T m ) / τ ) (11) -\sum_{m} q_m^{i2t}(I)\log p_m^{i2t}(I)=-sum_m \frac{\exp(s'(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s'(I,T_m)/\tau)}\log \frac{\exp(s(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s(I, T_m)/\tau)} \tag{11} −m∑qmi2t(I)logpmi2t(I)=−summ∑m=1Mexp(s′(I,Tm)/τ)exp(s′(I,Tm)/τ)log∑m=1Mexp(s(I,Tm)/τ)exp(s(I,Tm)/τ)(11)
It maximizes images that share similar semantics with text I I I Mutual information of MI ( I , T m ) \textbf{MI}(I,T_m) MI(I,Tm), Because these texts will have a larger q m i 2 t ( I ) q^{i2t}_m(I) qmi2t(I). Similarly , ITC M o D \text{ITC}_{MoD} ITCMoD It also maximizes the similarity to the image T T T Of MI ( I m , T ) \textbf{MI}(I_m,T) MI(Im,T). You can follow the same way , MLM M o D \text{MLM}_{MoD} MLMMoD To obscure words y m s k y^{msk} ymsk Generate optional views y ′ ∈ V y'\in\mathcal{V} y′∈V, And maximize it y ′ y' y′ and ( I , T ^ ) (I,\hat{T}) (I,T^) Maximize information for MI \text{MI} MI. therefore , Momentum distillation can be seen as performing data enhancement on the original view . The momentum model is generated with the original image-text Different views , And encourage the basic model to learn the representation of view invariant semantic information .
Four 、 The downstream V+L \text{V+L} V+L Mission

Five downstream V+L \text{V+L} V+L Apply the pre training model to the task . The following describes each task and fine tuning strategy .
1. Image-Text \text{Image-Text} Image-Text retrieval
Image-Text \text{Image-Text} Image-Text Contains two subtasks :image-to-text retrieval ( TR ) (\text{TR}) (TR) and text-to-image retrieval ( IR ) (\text{IR}) (IR). stay Flickr30K \text{Flickr30K} Flickr30K and COCO \text{COCO} COCO On a benchmark ALBEF \text{ALBEF} ALBEF, And use training samples from each data set to fine tune the pre training model . about Flickr30K \text{Flickr30K} Flickr30K Upper zero-shot retrieval , stay COCO \text{COCO} COCO Fine tune the model to evaluate . In the process of fine-tuning , Joint optimization ITC \text{ITC} ITC Loss function and ITM \text{ITM} ITM Loss function . ITC \text{ITC} ITC Learning based on single-mode similarity image-text Scoring function , and ITM \text{ITM} ITM Modeling fine-grained interactions between images and text to predict matching scores . Because each image in the downstream data set contains multiple texts , change ITC \text{ITC} ITC To consider multiple positive samples in the queue , The probability of each positive sample is 1. In the process of inference , For all the image-text Yes Calculate the characteristic similarity score s i t c s_{itc} sitc. then , use top-k \text{top-k} top-k As candidates and calculate their ITM \text{ITM} ITM fraction s i t m s_{itm} sitm Used to sort . because k k k Can be set very small , The speed of inference will be much faster .
2. Visual Entailment \text{Visual Entailment} Visual Entailment
Visual Entailment \text{Visual Entailment} Visual Entailment Used to predict whether images and text contain 、 Fine-grained visual reasoning tasks with equivalent or opposite relationships . Follow the model UNITER \text{UNITER} UNITER And consider Visual Entailment \text{Visual Entailment} Visual Entailment As a three category problem , And then in [CLS] The multimode encoder is expressed on the basis of MLP \text{MLP} MLP To predict category probability .
3. Visual Question Answering(VQA) \text{Visual Question Answering(VQA)} Visual Question Answering(VQA)
Given an image and a problem , VQA \text{VQA} VQA Need a model to predict an answer . Different from the existing methods, it will VQA \text{VQA} VQA As a multi answer classification question , The author will VQA \text{VQA} VQA Generate questions as an answer . say concretely , Use 6 Layer of Transformer \text{Transformer} Transformer Decoder to generate the answer . Pictured above ( a ) (a) (a) Shown , The autoregressive answer decoder receives multimodal embedding , And then [CLS] The vector of is used as the initial input of the decoder token. alike ,[SEP] It will be appended to the output of the decoder to indicate the completion of the generation . The answer decoder uses the pre training weight of the multi model encoder to initialize , And use conditional language loss function to fine tune . In order to make a fair comparison with the existing methods , In the process of reasoning, the constraint decoder can only start from 3192 Generated from subsequent answers .
4. Natural language for visual reasoning NLVR \text{NLVR} NLVR
NLVR \text{NLVR} NLVR The model needs to determine whether a text is a description of a pair of images . The authors extend the multimodal decoder to enable it to reason on two images . Pictured above ( b ) (b) (b) Shown , Each layer of the multimode encoder is repeated as two consecutive layers Transformer \text{Transformer} Transformer block , Each block contains a self - attention layer 、 A cross attention layer and a forward propagation layer . Two blocks in each layer will be initialized with the same pre training weight , Two cross attention can share the same linear projection weight . In the training model , Two blocks receive two embedded sets of image pairs . In the multimode encoder [CLS] Means to append a MLP \text{MLP} MLP A classifier is used to predict .
about NLVR \text{NLVR} NLVR, Perform additional pre training steps to prepare a new multimodal encoder for the encoded image pair . The author designed a text assignment task ( text-assignment,TA ) (\text{text-assignment,TA}) (text-assignment,TA): Given an image and text pair , The model needs to assign text to the first image 、 The second image 、 Or not at all . The author defines it as a three classification problem , And in [CLS] Use... On the expression FC \text{FC} FC Layer to predict allocation . stay 4 M 4M 4M Use... On images TA \text{TA} TA Preliminary training 1 individual epoch.
5. Visual Grounding \text{Visual Grounding} Visual Grounding
Visual Grounding \text{Visual Grounding} Visual Grounding The goal of is to locate the area in the image that is related to a specific text description . The author studies the weak supervision setting , That is to say, there are no marks bounding box \text{bounding box} bounding box. The author in RefCOCO+ \text{RefCOCO+} RefCOCO+ Perform experiments on datasets , And use the image-text Retrieve the results of the same policy image-text Monitor to fine tune the model . In the process of inference , The author extends Grad-CAM \text{Grad-CAM} Grad-CAM To get the heat map , And use them to sort the detected objects .
5、 ... and 、 experiment
A little
边栏推荐
- 华为设备配置HoVPN
- Some problems about Tencent domain name resolution and Alibaba cloud server
- 阿里云服务器mysql远程连接一直连不上
- [solution] solution to asymmetric and abnormal transformation caused by modifying the transform information of sub objects
- 栈(C语言)
- The remote connection to redis is disconnected and reconnected after a while
- Queue (C language)
- CloudCompare源码分析:读取ply文件
- 2022年安全员-B证理论题库及模拟考试
- 小程序启动性能优化实践
猜你喜欢

Small program startup performance optimization practice

H.265编码原理入门

Learn to crawl for a month and earn 6000 a month? Don't be fooled. The teacher told you the truth about the reptile

2022年安全員-A證考題模擬考試平臺操作

Application of Lora technology in long distance wireless transmission of water meter reading
![[Day8 literature extensive reading] space and time in the child's mind: evidence for a cross dimensional symmetry](/img/c2/e70e7c32c5dc5554dea29cb4627644.png)
[Day8 literature extensive reading] space and time in the child's mind: evidence for a cross dimensional symmetry

The second bullet of in-depth dialogue with the container service ack distribution: how to build a hybrid cloud unified network plane with the help of hybridnet
![[Day2 intensive literature reading] time in the mind: using space to think about time](/img/7a/b155ee0c136f911a7e99e9633af246.png)
[Day2 intensive literature reading] time in the mind: using space to think about time

Jetpack architecture component learning (3) -- activity results API usage

CVPR 2022 | meta learning performance in image regression task
随机推荐
Jsonparseexception: unrecognized token 'username': was expecting error when submitting login data
Try catch
mysql——find_in_set用法
The top ten trends of 2022 industrial Internet security was officially released
16 | floating point numbers and fixed-point numbers (Part 2): what is the use of a deep understanding of floating-point numbers?
Si4432 RF chip scheme typical application of data transmission of wireless communication module of Internet of things
Only three steps are needed to learn how to use low code thingjs to connect with Sen data Dix data
[Day10 literature extensive reading] temporary cognition can affect spatial cognition more than vice versa: the effect of
2022新兴市场品牌出海线上峰会即将举办 ADVANCE.AI CEO寿栋将受邀出席
关于腾讯域名解析阿里云服务器的一些坑
[day11-12 intensive literature reading] on languages in memory: an internal clock account of space-time interaction
Dynamics 365 option set operation
Fonctionnement de la plate - forme d'examen de simulation pour les agents de sécurité - Questions d'examen de certificat a en 2022
2022年低压电工上岗证题目及在线模拟考试
Teacher lihongyi, NTU -- tips for DNN regulation
Swiper -- a solution to the conflict of single page multicast plug-ins
Solution to page locking caused by xshell accidentally pressing ctrl+s
华为设备配置HoVPN
Wake up wrist - neural network and deep learning (tensorflow application) updating
Summary of personal wrong questions (the wrong questions have not been solved and are used for help)