当前位置:网站首页>[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation
[naturallanguageprocessing] [multimodal] albef: visual language representation learning based on momentum distillation
2022-06-11 23:08:00 【BQW_】
Address of thesis :https://arxiv.org/pdf/2107.07651.pdf
Related blog :
【 natural language processing 】【 Multimodal 】CLIP: Learning transferable visual models from natural language supervision
【 natural language processing 】【 Multimodal 】ViT-BERT: The unified basic model is pre trained on non image text pair data
【 natural language processing 】【 Multimodal 】BLIP: Bootstrap language image pre training for unified visual language understanding and generation
【 natural language processing 】【 Multimodal 】FLAVA: A basic language and visual alignment model
【 natural language processing 】【 Multimodal 】SIMVLM: Simple visual language model pre training based on weak supervision
【 natural language processing 】【 Multimodal 】UniT: Based on Unification Transformer Multimodal multi task learning
【 natural language processing 】【 Multimodal 】Product1M: Weakly supervised case level product retrieval based on cross modal pre training
【 natural language processing 】【 Multimodal 】ALBEF: Visual language representation learning based on momentum distillation
One 、 brief introduction
Visual language pre training (Vision-and-Language Pre-training,VLP) \text{(Vision-and-Language Pre-training,VLP)} (Vision-and-Language Pre-training,VLP) The goal is to start on a large scale image-text Yes Learn multimodal representation , Used to improve downstream visual language tasks ( Vision-and-Language,V+L ) (\text{Vision-and-Language,V+L}) (Vision-and-Language,V+L). Many existing VLP \text{VLP} VLP The method relies on a pre trained target detector to extract regions based on image features , A multi-modal code is used to fuse image features and word features . Multimodal coders are trained to solve tasks that require joint understanding of images and texts , for example :masked language modeling and image-text matching.
Although effective , But these VLP \text{VLP} VLP The framework has several relational limitations :(1) Image features and word embedding are in their own space , This makes it more challenging for multimodal coders to learn to model their interactions ;(2) The standard and calculation of target detector are very expensive , Because it needs to be manually marked during pre training bounding box, And it is a high-resolution image at the time of inference ;(3) Widely used image-text Data sets are collected from the network and there are a large class of noise , Existing image MLM \text{MLM} MLM Such a pre training target may over fit the noisy text , And reduce the generalization performance of the model .
The author puts forward ALBEF(ALign BEfore Fuse) \text{ALBEF(ALign BEfore Fuse)} ALBEF(ALign BEfore Fuse), A new VLP \text{VLP} VLP Framework to address these limitations . First, an image encoder and a text encoder without detector will be used to encode the image and text independently . then , Multimodal coders fuse image features and text features through a cross modal attention mechanism . The author introduces an intermediate image-text Compare the loss function ( ITC ) (\text{ITC}) (ITC), It is applied to the representation of single-mode encoder , It has three purposes :(1) Align image features with text features , It makes it easier for the multimode encoder to perform cross modal learning ;(2) Improve the single-mode encoder to better understand the semantics of images and texts ;(3) It can learn a common low dimensional space to embed images and text , Find more informative samples by comparing difficult samples mining .
To improve learning under noise supervision , Momentum distillation is proposed MoD \text{MoD} MoD, A simple way to enable the model to take advantage of larger noisy data sets . In the process of training , Maintain a momentum version of the model by averaging the model parameters , The momentum model is used to generate pseudo tags as additional supervision . Use MoD \text{MoD} MoD, The model should not be penalized for producing reasonable output that is different from the network annotations . MoD \text{MoD} MoD Not only can pre training be improved , It can also improve downstream tasks .
From the perspective of maximum mutual information, the author provides ALBEF \text{ALBEF} ALBEF Theoretical analysis of . Specially , ITC \text{ITC} ITC and MLM \text{MLM} MLM Maximized image-text The lower boundary of mutual information for different views , These views are generated by taking partial information from each pair . From this point of view , Momentum distillation can be interpreted as generating new views with the same semantics . therefore , ALBEF \text{ALBEF} ALBEF Be able to learn visual language representation that does not change the semantic representation .
The author is in various downstream V+L \text{V+L} V+L The mission proved ALBEF \text{ALBEF} ALBEF The effectiveness of the , contain image-text retrieval 、 Visual Q & A 、 Visual reasoning 、 Visual implication and weak supervision visual grounding. ALBEF \text{ALBEF} ALBEF More than the existing state-of-the-art The method achieves significant improvement . stay image-text Searching , It is better than those methods that are pre trained on a larger data set ( CLIP \text{CLIP} CLIP and ALIGN \text{ALIGN} ALIGN). stay VQA \text{VQA} VQA and NLVR \text{NLVR} NLVR On , Compare with state-of-the-art Method VILLA \text{VILLA} VILLA, Its implementation 2.37% and 3.84% Improvement , And it has faster reasoning speed . Besides , The author also uses Grad-CAM \text{Grad-CAM} Grad-CAM Yes ALBEF \text{ALBEF} ALBEF Qualitative and quantitative analysis are carried out .
Two 、 ALBEF \text{ALBEF} ALBEF Preliminary training

1. Model architecture
As shown in the figure above , ALBEF \text{ALBEF} ALBEF Contains an image encoder 、 A text encoder And a multimode encoder . Use 12 Layer of ViT-B/16 \text{ViT-B/16} ViT-B/16 As an image encoder , And then use it in ImageNet-1K \text{ImageNet-1K} ImageNet-1K Initialize with the weight obtained from the upper pre training . An input image I \text{I} I Encoded as an embedded sequence : { v c l s , v 1 , … , v N } \{\textbf{v}_{cls},\textbf{v}_1,\dots,\textbf{v}_N\} { vcls,v1,…,vN}, among v c l s v_{cls} vcls yes [CLS] Embedding vector of . Use 6 Layer of Transformer \text{Transformer} Transformer As text encoder and multimode encoder . The text encoder uses BERT b a s e \text{BERT}_{base} BERTbase Before 6 Layer initialization , The multimode encoder layer uses BERT b a s e \text{BERT}_{base} BERTbase After 6 Layer initialization . The text encoder will enter text T T T Convert to an embedded vector sequence { w c l s , w 1 , … , w N } \{\textbf{w}_{cls},\textbf{w}_1,\dots,\textbf{w}_N\} { wcls,w1,…,wN}, It will be input to the multimode encoder . In each layer of the multi-modal encoder, image features and text features are fused through the attention mechanism .
2. Pre training objectives
Pre training with three objective functions ALBEF \text{ALBEF} ALBEF: Single mode encoder image-text Comparative learning ( ITC ) (\text{ITC}) (ITC), Masking language model on multimode encoder ( MLM ) (\text{MLM}) (MLM) and image-text matching ( ITM ) (\text{ITM}) (ITM). Besides , Here we also use online hard negative sample mining comparison to improve ITM \text{ITM} ITM.
2.1 Image-text \text{Image-text} Image-text Comparative learning ( ITC ) (\text{ITC}) (ITC)
The goal of the loss function is to better learn the unimodal representation before fusion . It will learn a similar function s = g v ( v c l s ) ⊤ g w ( w c l s ) s=g_v(\textbf{v}_{cls})^\top g_w(\textbf{w}_{cls}) s=gv(vcls)⊤gw(wcls), Make parallel image-text Yes Have higher similarity scores . g v g_v gv and g w g_w gw Yes, it will [CLS] The embedded vector is mapped to a linear transformation of normalized low dimensional representation . suffer MoCo \text{MoCo} MoCo inspire , Two queues are maintained to store the most recent... From the momentum singlemode encoder M M M individual image-text Express . The normalized feature from the momentum encoder is expressed as g v ′ ( v c l s ′ ) g_v'(\textbf{v}_{cls}') gv′(vcls′) and g w ′ ( w c l s ′ ) g_w'(\textbf{w}_{cls}') gw′(wcls′). Definition s ( I , T ) = g v ( v c l s ) ⊤ g w ′ ( w c l s ′ ) s(I,T)=g_v(\textbf{v}_{cls})^\top g_w'(\textbf{w}_{cls}') s(I,T)=gv(vcls)⊤gw′(wcls′) And s ( T , I ) = g w ( w c l s ) ⊤ g v ′ ( v c l s ′ ) s(T,I)=g_w(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}') s(T,I)=gw(wcls)⊤gv′(vcls′).
For each image and text , Calculation image-to-text and text-to-image The similarity of is :
p m i 2 t ( I ) = e x p ( s ( I , T m ) / τ ) ∑ m = 1 M e x p ( s ( I , T m ) τ ) , p m t 2 i = e x p ( s ( T , I m ) / τ ) ∑ m = 1 M e x p ( s ( T , I m ) / τ ) (1) p_m^{i2t}(I)=\frac{exp(s(I,T_m)/\tau)}{\sum_{m=1}^M exp(s(I,T_m)\tau)},\quad p_m^{t2i}=\frac{exp(s(T,I_m)/\tau)}{\sum_{m=1}^M exp(s(T, I_m)/\tau)} \tag{1} pmi2t(I)=∑m=1Mexp(s(I,Tm)τ)exp(s(I,Tm)/τ),pmt2i=∑m=1Mexp(s(T,Im)/τ)exp(s(T,Im)/τ)(1)
among , τ \tau τ Is learnable temperature Parameters . Make y i 2 t ( I ) \textbf{y}^{i2t}(I) yi2t(I) and y t 2 i ( T ) \textbf{y}^{t2i}(T) yt2i(T) To express true one-hot Similarity degree , Where the negative sample pair has probability 0 And the probability of positive sample pairs is 1.image-text The comparison loss function is defined as p \textbf{p} p and y \textbf{y} y Cross entropy of H H H:
L i t c = 1 2 E ( I , T ) ∼ D [ H ( y i 2 t ( I ) , p i 2 t ( I ) ) + H ( y t 2 i ( T ) , p t 2 i ( T ) ) ] (2) \mathcal{L}_{itc}=\frac{1}{2}\mathbb{E}_{(I,T)\sim D}\big[H(\textbf{y}^{i2t}(I),\textbf{p}^{i2t}(I))+H(\textbf{y}^{t2i}(T),\textbf{p}^{t2i}(T))\big] \tag{2} Litc=21E(I,T)∼D[H(yi2t(I),pi2t(I))+H(yt2i(T),pt2i(T))](2)
2.2 Masking language model ( MLM ) (\text{MLM}) (MLM)
MLM \text{MLM} MLM Can use images and text to predict the obscured words . With 15% The probability of random masking input tokens, And use special [MASK] token Replace . Make T ^ \hat{T} T^ Represented as obscured text , also p m s k ( I , T ^ ) \textbf{p}^{msk}(I,\hat{T}) pmsk(I,T^) Represents the model's effect on shadowing token The probability of prediction . MLM \text{MLM} MLM Minimize cross entropy loss :
L m l m = E ( I , T ^ ) ∼ D H ( y m s k , p m s k ( I , T ^ ) ) (3) \mathcal{L}_{mlm}=\mathbb{E}_{(I,\hat{T})\sim D} H(\textbf{y}^{msk},\textbf{p}^{msk}(I,\hat{T})) \tag{3} Lmlm=E(I,T^)∼DH(ymsk,pmsk(I,T^))(3)
among , y m s k \textbf{y}^{msk} ymsk It's a one-hot Thesaurus distribution , The real token The probability of 1.
2.3 Image-Text \text{Image-Text} Image-Text matching ( ITM ) (\text{ITM}) (ITM)
ITM \text{ITM} ITM Predict whether the image and text pairs match or do not match . Use a multimode encoder for [CLS] The output embedded vector is used as image-text Yes Joint representation of , And through the full link layer with a softmax To predict the probability of two categories p i t m p^{itm} pitm. ITM \text{ITM} ITM The loss function is :
L i t m = E ( I , T ) ∼ D H ( y i t m , p i t m ( I , T ) ) (4) \mathcal{L}_{itm}=\mathbb{E}_{(I,T)\sim D} H(\textbf{y}^{itm},\textbf{p}^{itm}(I,T)) \tag{4} Litm=E(I,T)∼DH(yitm,pitm(I,T))(4)
among , y i t m \textbf{y}^{itm} yitm It's a two-dimensional one-hot Vector representation .
Besides , The author puts forward an aim at ITM \text{ITM} ITM Hard negative sample sampling strategy for tasks . If image-text Yes Share similar semantics but differ in fine-grained details , Then it can be considered as a difficult negative sample . Using the equation ( 1 ) (1) (1) To look for the comparative similarity in batch Internal hard negative samples . about batch Each image in , From the same batch Sampling a negative text according to the comparative similarity distribution in , The more similar the text is to the image, the more likely it will be sampled . Similarly , Sample a hard negative image for each text .
- ALBEF \text{ALBEF} ALBEF All the pre training objective functions of are :
L = L i t c + L m l m + L i t m (5) \mathcal{L}=\mathcal{L}_{itc}+\mathcal{L}_{mlm}+\mathcal{L}_{itm} \tag{5} L=Litc+Lmlm+Litm(5)
3. Momentum distillation
For pre training image-text Yes The data is mainly collected from the network , And it contains noise . Positive sample pairs usually have weak correlations : The text may contain words that are not related to the image , Or the image may contain entities not described in the text . about ITC \text{ITC} ITC Study , The negative text of an image may also match the content of the image . about MLM \text{MLM} MLM, There are some words that are different from annotations that can better describe the image . However , ITC \text{ITC} ITC and MLM \text{MLM} MLM Of one-hot The tag penalizes all negative predictions , And ignore these correctness .
To solve this problem , The author proposes to learn from the pseudo target generated by momentum model . Momentum model is composed of the exponential moving average version of the single-mode encoder and multi-mode encoder 、 The evolving teacher model . In the process of training , Train the basic model to match its prediction with that of the momentum model . Specially , about ITC \text{ITC} ITC, First, the characteristics of the momentum single-mode encoder are used to calculate image-text The similarity is s ′ ( I , T ) = g v ′ ( v c l s ′ ) ⊤ g w ′ ( w c l s ′ ) s'(I,T)=g_v'(\textbf{v}_{cls}')^\top g_w'(\textbf{w}_{cls}') s′(I,T)=gv′(vcls′)⊤gw′(wcls′) and s ′ ( T , I ) = g w ′ ( w c l s ) ⊤ g v ′ ( v c l s ′ ) s'(T, I)=g_w'(\textbf{w}_{cls})^\top g_v'(\textbf{v}_{cls}') s′(T,I)=gw′(wcls)⊤gv′(vcls′). then , By replacing the equation ( 1 ) (1) (1) Medium s s s and s ′ s' s′ To calculate pseudo tags q i 2 t \textbf{q}^{i2t} qi2t and q t 2 i \textbf{q}^{t2i} qt2i. ITC M o D \text{ITC}_{MoD} ITCMoD The loss function is defined as :
L i t c m o d = ( 1 − α ) L i t c + α 2 E ( I , T ) ∼ D [ KL ( q i 2 t ( I ) ∥ p i 2 t ( I ) ) + KL ( q t 2 i ( T ) ∥ p t 2 i ( T ) ) ] (6) \mathcal{L}_{itc}^{mod}=(1-\alpha)\mathcal{L}_{itc}+\frac{\alpha}{2}\mathbb{E}_{(I,T)\sim D}\big[\text{KL}(\textbf{q}^{i2t}(I)\parallel\textbf{p}^{i2t}(I))+\text{KL}(\textbf{q}^{t2i}(T)\parallel\textbf{p}^{t2i}(T))\big] \tag{6} Litcmod=(1−α)Litc+2αE(I,T)∼D[KL(qi2t(I)∥pi2t(I))+KL(qt2i(T)∥pt2i(T))](6)
Similarly , about MLM \text{MLM} MLM, Make q m s k ( I , T ^ ) \textbf{q}^{msk}(I,\hat{T}) qmsk(I,T^) Represents the momentum model for shadowing token The probability of prediction , MLM M o D \text{MLM}_{MoD} MLMMoD The loss function is :
L m l m m o d = ( 1 − α ) L m l m + α E ( I , T ^ ) ∼ D KL ( q m s k ( I , T ^ ) ∥ p m s k ( I , T ^ ) ) (7) \mathcal{L}_{mlm}^{mod}=(1-\alpha)\mathcal{L}_{mlm}+\alpha\mathbb{E}_{(I,\hat{T})\sim D}\text{KL}(\textbf{q}^{msk}(I,\hat{T})\parallel\textbf{p}^{msk}(I,\hat{T})) \tag{7} Lmlmmod=(1−α)Lmlm+αE(I,T^)∼DKL(qmsk(I,T^)∥pmsk(I,T^))(7)
Above picture , Showing the pseudo target top-5 The candidate , It effectively captures the relevant words of an image / Text .
The author will MoD \text{MoD} MoD Apply to downstream tasks . The final loss function of each task is the weighted combination of the original task loss function , And model prediction and pseudo label KL \text{KL} KL The divergence . For simplicity , Set weights for all pre training and downstream tasks α = 0.4 \alpha=0.4 α=0.4.
4. Pre training dataset
follow UNITER \text{UNITER} UNITER, Use two network datasets ( Conceptual Captions , SBU Captions ) (\text{Conceptual Captions},\text{SBU Captions}) (Conceptual Captions,SBU Captions) And two domain datasets ( COCO , Visual Genome ) (\text{COCO},\text{Visual Genome}) (COCO,Visual Genome). The number of unique images is 4M, also image-text Yes The number is 5.1M. In order to show the expansibility of this method in large-scale network data , The author also introduces more noisy Conceptual 12M \text{Conceptual 12M} Conceptual 12M Data sets , The total number of images increased to 14.1M.
5. Implementation details
The model in this paper is composed of 123.7M Parametric BERT b a s e \text{BERT}_{base} BERTbase And have 85.8M Parametric ViT-B/16 \text{ViT-B/16} ViT-B/16. stay 8 block NVIDIA A100 GPUs Upper use batch size by 512 Way to pre train the model 30 individual epochs. Use with gradient attenuation of 0.02 Of AdamW Optimizer . before 1000 The learning rate is preheated to 1 e − 4 1e^{-4} 1e−4, And then decay to... According to the cosine schedule 1 e − 5 1e^{-5} 1e−5. During pre training , Using random image clipping resolution 256 × 256 256\times 256 256×256 As input , And apply RandAugment \text{RandAugment} RandAugment. In the process of fine-tuning , Increase image resolution to 384 × 384 384\times 384 384×384, And for the image patches Insert position code . The momentum parameter of the updated momentum model is set to 0.995, be used for image-text The queue size for comparative learning is set to 65536. At the first epoch Medium distillation weight α \alpha α in 0 Linear increase to 0.4.
3、 ... and 、 From the perspective of maximizing mutual information
In this section , Provide a ALBEF \text{ALBEF} ALBEF Alternative perspectives for , And it shows that it is to maximize image-text Yes The lower boundary of mutual information from different perspectives . ITC \text{ITC} ITC、 MLM \text{MLM} MLM and MoD \text{MoD} MoD Can be interpreted as different ways of generating views .
Officially , Define two random variables a a a and b b b Two different perspectives for a data point . In self supervised learning , a a a and b b b Are two enhancement samples of the same image . In vision - Language means learning , consider a a a and b b b yes image-text Capture different variants of the same semantics . The goal is to learn a representation that doesn't change with perspective . This can be achieved by maximizing a a a and b b b To maximize mutual information . In practice , By minimizing InfoNCE \text{InfoNCE} InfoNCE Loss function to maximize MI(a,b) \text{MI(a,b)} MI(a,b) The lower boundary of .
L N C E = − E p ( a , b ) [ log exp ( s ( a , b ) ) ∑ b ^ ∈ B ^ exp ( s ( a , b ^ ) ) ] (8) \mathcal{L}_{NCE}=-\mathbb{E}_{p(a,b)}\Bigg[\text{log}\frac{\exp(s(a,b))}{\sum_{\hat{b}\in\hat{B}}\exp(s(a,\hat{b}))}\Bigg] \tag{8} LNCE=−Ep(a,b)[log∑b^∈B^exp(s(a,b^))exp(s(a,b))](8)
among , s ( a , b ) s(a,b) s(a,b) It's a scoring function , B ^ \hat{B} B^ Contains positive samples b b b and ∣ B ^ − 1 ∣ |\hat{B}-1| ∣B^−1∣ Negative samples .
In this paper, the ITC \text{ITC} ITC The loss function can be rewritten as :
L i t c = − 1 2 E p ( I , T ) [ log exp ( s ( I , T ) / τ ) ∑ m = 1 M exp ( s ( I , T m ) / τ ) + log exp ( s ( T , I ) / τ ) ∑ m = 1 M exp ( s ( T , I m ) / τ ) ] (9) \mathcal{L}_{itc}=-\frac{1}{2}\mathbb{E}_{p(I,T)}\Big[\log\frac{\exp(s(I,T)/\tau)}{\sum_{m=1}^M\exp(s(I,T_m)/\tau)}+\log\frac{\exp(s(T,I)/\tau)}{\sum_{m=1}^M\exp(s(T,I_m)/\tau)} \Big] \tag{9} Litc=−21Ep(I,T)[log∑m=1Mexp(s(I,Tm)/τ)exp(s(I,T)/τ)+log∑m=1Mexp(s(T,Im)/τ)exp(s(T,I)/τ)](9)
To minimize the L i t c \mathcal{L}_{itc} Litc Can be seen as a maximized symmetric version of InfoNCE \text{InfoNCE} InfoNCE. therefore , ITC \text{ITC} ITC Take two independent modes as image-text Yes Two views of , The single-mode encoder is trained to maximize the angle of view of the image and text MI \text{MI} MI.
MLM \text{MLM} MLM It can also be interpreted as masking the maximum mutual information between a word and its context . say concretely , Can be rewritten MLM \text{MLM} MLM The loss function is
L m l m = − E p ( I , T ^ ) [ log exp ( ψ ( y m s k ) ⊤ f ( I , T ^ ) ) ∑ y ∈ V exp ( ψ ( y ) ⊤ f ( I , T ^ ) ) ] (10) \mathcal{L}_{mlm}=-\mathbb{E}_{p(I, \hat{T})}\big[\log\frac{\exp(\psi(y^{msk})^\top f(I,\hat{T}))}{\sum_{y\in\mathcal{V}}\exp(\psi(y)^\top f(I,\hat{T}))}\big] \tag{10} Lmlm=−Ep(I,T^)[log∑y∈Vexp(ψ(y)⊤f(I,T^))exp(ψ(ymsk)⊤f(I,T^))](10)
among , ψ ( y ) : V → R d \psi(y):\mathcal{V}\rightarrow \mathbb{R}^d ψ(y):V→Rd It is the output layer of multimode encoder lookup function , Map words token y y y To a vector , also V \mathcal{V} V Is a collection of the whole vocabulary , also f ( I , T ^ ) f(I,\hat{T}) f(I,T^) Is a final function that returns the masking context corresponding to the multimode encoder hidden state. therefore , MLM \text{MLM} MLM take image-text Yes Think of it as two views :(1) A randomly chosen word token;(2) Images + The context of the obscured word ;
ITC \text{ITC} ITC and MLM \text{MLM} MLM By getting from image-text Yes Take some information to generate views . Momentum distillation in this paper can be seen as generating alternative views from the entire distribution . With the equation ( 6 ) (6) (6) Of ITC M o D \text{ITC}_{MoD} ITCMoD For example , To minimize the KL ( p i 2 t ( I ) , q i 2 t ( I ) ) \text{KL}(\textbf{p}^{i2t}(I),\textbf{q}^{i2t}(I)) KL(pi2t(I),qi2t(I)) Equivalent to minimizing the following objective function
− ∑ m q m i 2 t ( I ) log p m i 2 t ( I ) = − s u m m exp ( s ′ ( I , T m ) / τ ) ∑ m = 1 M exp ( s ′ ( I , T m ) / τ ) log exp ( s ( I , T m ) / τ ) ∑ m = 1 M exp ( s ( I , T m ) / τ ) (11) -\sum_{m} q_m^{i2t}(I)\log p_m^{i2t}(I)=-sum_m \frac{\exp(s'(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s'(I,T_m)/\tau)}\log \frac{\exp(s(I,T_m)/\tau)}{\sum_{m=1}^M\exp(s(I, T_m)/\tau)} \tag{11} −m∑qmi2t(I)logpmi2t(I)=−summ∑m=1Mexp(s′(I,Tm)/τ)exp(s′(I,Tm)/τ)log∑m=1Mexp(s(I,Tm)/τ)exp(s(I,Tm)/τ)(11)
It maximizes images that share similar semantics with text I I I Mutual information of MI ( I , T m ) \textbf{MI}(I,T_m) MI(I,Tm), Because these texts will have a larger q m i 2 t ( I ) q^{i2t}_m(I) qmi2t(I). Similarly , ITC M o D \text{ITC}_{MoD} ITCMoD It also maximizes the similarity to the image T T T Of MI ( I m , T ) \textbf{MI}(I_m,T) MI(Im,T). You can follow the same way , MLM M o D \text{MLM}_{MoD} MLMMoD To obscure words y m s k y^{msk} ymsk Generate optional views y ′ ∈ V y'\in\mathcal{V} y′∈V, And maximize it y ′ y' y′ and ( I , T ^ ) (I,\hat{T}) (I,T^) Maximize information for MI \text{MI} MI. therefore , Momentum distillation can be seen as performing data enhancement on the original view . The momentum model is generated with the original image-text Different views , And encourage the basic model to learn the representation of view invariant semantic information .
Four 、 The downstream V+L \text{V+L} V+L Mission

Five downstream V+L \text{V+L} V+L Apply the pre training model to the task . The following describes each task and fine tuning strategy .
1. Image-Text \text{Image-Text} Image-Text retrieval
Image-Text \text{Image-Text} Image-Text Contains two subtasks :image-to-text retrieval ( TR ) (\text{TR}) (TR) and text-to-image retrieval ( IR ) (\text{IR}) (IR). stay Flickr30K \text{Flickr30K} Flickr30K and COCO \text{COCO} COCO On a benchmark ALBEF \text{ALBEF} ALBEF, And use training samples from each data set to fine tune the pre training model . about Flickr30K \text{Flickr30K} Flickr30K Upper zero-shot retrieval , stay COCO \text{COCO} COCO Fine tune the model to evaluate . In the process of fine-tuning , Joint optimization ITC \text{ITC} ITC Loss function and ITM \text{ITM} ITM Loss function . ITC \text{ITC} ITC Learning based on single-mode similarity image-text Scoring function , and ITM \text{ITM} ITM Modeling fine-grained interactions between images and text to predict matching scores . Because each image in the downstream data set contains multiple texts , change ITC \text{ITC} ITC To consider multiple positive samples in the queue , The probability of each positive sample is 1. In the process of inference , For all the image-text Yes Calculate the characteristic similarity score s i t c s_{itc} sitc. then , use top-k \text{top-k} top-k As candidates and calculate their ITM \text{ITM} ITM fraction s i t m s_{itm} sitm Used to sort . because k k k Can be set very small , The speed of inference will be much faster .
2. Visual Entailment \text{Visual Entailment} Visual Entailment
Visual Entailment \text{Visual Entailment} Visual Entailment Used to predict whether images and text contain 、 Fine-grained visual reasoning tasks with equivalent or opposite relationships . Follow the model UNITER \text{UNITER} UNITER And consider Visual Entailment \text{Visual Entailment} Visual Entailment As a three category problem , And then in [CLS] The multimode encoder is expressed on the basis of MLP \text{MLP} MLP To predict category probability .
3. Visual Question Answering(VQA) \text{Visual Question Answering(VQA)} Visual Question Answering(VQA)
Given an image and a problem , VQA \text{VQA} VQA Need a model to predict an answer . Different from the existing methods, it will VQA \text{VQA} VQA As a multi answer classification question , The author will VQA \text{VQA} VQA Generate questions as an answer . say concretely , Use 6 Layer of Transformer \text{Transformer} Transformer Decoder to generate the answer . Pictured above ( a ) (a) (a) Shown , The autoregressive answer decoder receives multimodal embedding , And then [CLS] The vector of is used as the initial input of the decoder token. alike ,[SEP] It will be appended to the output of the decoder to indicate the completion of the generation . The answer decoder uses the pre training weight of the multi model encoder to initialize , And use conditional language loss function to fine tune . In order to make a fair comparison with the existing methods , In the process of reasoning, the constraint decoder can only start from 3192 Generated from subsequent answers .
4. Natural language for visual reasoning NLVR \text{NLVR} NLVR
NLVR \text{NLVR} NLVR The model needs to determine whether a text is a description of a pair of images . The authors extend the multimodal decoder to enable it to reason on two images . Pictured above ( b ) (b) (b) Shown , Each layer of the multimode encoder is repeated as two consecutive layers Transformer \text{Transformer} Transformer block , Each block contains a self - attention layer 、 A cross attention layer and a forward propagation layer . Two blocks in each layer will be initialized with the same pre training weight , Two cross attention can share the same linear projection weight . In the training model , Two blocks receive two embedded sets of image pairs . In the multimode encoder [CLS] Means to append a MLP \text{MLP} MLP A classifier is used to predict .
about NLVR \text{NLVR} NLVR, Perform additional pre training steps to prepare a new multimodal encoder for the encoded image pair . The author designed a text assignment task ( text-assignment,TA ) (\text{text-assignment,TA}) (text-assignment,TA): Given an image and text pair , The model needs to assign text to the first image 、 The second image 、 Or not at all . The author defines it as a three classification problem , And in [CLS] Use... On the expression FC \text{FC} FC Layer to predict allocation . stay 4 M 4M 4M Use... On images TA \text{TA} TA Preliminary training 1 individual epoch.
5. Visual Grounding \text{Visual Grounding} Visual Grounding
Visual Grounding \text{Visual Grounding} Visual Grounding The goal of is to locate the area in the image that is related to a specific text description . The author studies the weak supervision setting , That is to say, there are no marks bounding box \text{bounding box} bounding box. The author in RefCOCO+ \text{RefCOCO+} RefCOCO+ Perform experiments on datasets , And use the image-text Retrieve the results of the same policy image-text Monitor to fine tune the model . In the process of inference , The author extends Grad-CAM \text{Grad-CAM} Grad-CAM To get the heat map , And use them to sort the detected objects .
5、 ... and 、 experiment
A little
边栏推荐
- 华为设备配置HoVPN
- 通用树形结构的迭代与组合模式实现方案
- MySQL 8.0 decompressed version installation tutorial
- postgresql10 進程
- mysql——find_in_set用法
- [day11-12 intensive literature reading] on languages in memory: an internal clock account of space-time interaction
- 2022年高处安装、维护、拆除操作证考试题库模拟考试平台操作
- [day3 literature intensive reading] Oriental time and space interaction in tau and kappa effects
- CloudCompare源码分析:读取ply文件
- postgresql10 进程
猜你喜欢

Application of Lora wireless communication module Lora technology in smart home light control

The top ten trends of 2022 industrial Internet security was officially released
![Tensorflow [actual Google deep learning framework] uses HDF5 to process large data sets with tflearn](/img/d0/586b9f09dc19d5aaf8ccca687b7b10.jpg)
Tensorflow [actual Google deep learning framework] uses HDF5 to process large data sets with tflearn

【Day8 文献泛读】Space and Time in the Child‘s Mind: Evidence for a Cross-Dimensional Asymmetry

Zigbee3.0 wireless packet capturing installation method based on e18-2g4u04b

Google搜索为什么不能无限分页?

Learn to crawl for a month and earn 6000 a month? Don't be fooled. The teacher told you the truth about the reptile
![[day4 literature intensive reading] space – time interdependence: evidence against Asymmetric mapping between time and space](/img/ce/f3817690a024cfebcf58a5ccc3cfdc.png)
[day4 literature intensive reading] space – time interdependence: evidence against Asymmetric mapping between time and space

2022年起重机司机(限桥式起重机)考试题模拟考试题库及模拟考试

Unity3d C#开发微信小游戏音频/音效播放问题解决过程分享
随机推荐
Four rounding modes in IEEE754 standard
Deconstruction of volatile | community essay solicitation
5. Xuecheng project Alipay payment
遇到表格,手动翻页太麻烦?我教你写脚本,一页展示所有数据
2022年高处安装、维护、拆除操作证考试题库模拟考试平台操作
2022安全员-C证判断题模拟考试平台操作
MySQL 8.0 decompressed version installation tutorial
Library management system
JS common method collection
【Day6-7 文献精读】A unifying Bayesian framework accounting for spatiotemporal interferences with a ...
【Day13-14 文献精读】Cross-dimensional magnitude interactions arise from memory interference
Why can't Google search page infinite?
Review C language I
Is the product stronger or weaker, and is the price unchanged or reduced? Talk about domestic BMW X5
【Day11-12 文献精读】On magnitudes in memory: An internal clock account of space-time interaction
IEEE浮点数尾数向偶舍入-四舍六入五成双
Teacher lihongyi, NTU -- tips for DNN regulation
Stack (C language)
产品力进阶新作,全新第三代荣威RX5盲订开启
直播预告|FeatureStore Meetup V3 重磅来袭!