当前位置：网站首页>Zero shot image retrieval (zero sample cross modal retrieval)

Zero shot image retrieval (zero sample cross modal retrieval)

2022-07-25 12:02:00 【Shangshanxianger】

The last blog post briefly sorted Meta learning and less sample learning , This article focuses on sorting out several articles that use zero sample learning to do retrieval . The difficulty of this problem is that human sketches are used as queries to retrieve photos from invisible categories ：

Sketches and pictures differ greatly across modal domains .sketch Only the outline of the object , And image Compared with little information .
Because different people have different painting styles ,sketch The intra class variance of is also large .
How to adapt to large-scale retrieval , Adaptation from Unseen Image retrieved from .

Insert picture description here
A Zero-Shot Framework for Sketch Based Image Retrieval
come from ECCV2018. The main idea is to use generative models to solve problems , The advantage of this is by generating models , You can add some sketch Information , Thus, the model can learn to draw the outline of the sketch 、 Local shape and other features are associated with the corresponding features of the image . The specific model is shown in the figure above , On the left and right are the author's two architectures CVAE and CAAE, That is, two mainstream generation models are used for testing （VAE and GAN）.

CVAE It's using Conditional variational self encoder , That is, a certain feature as a condition to participate in VAE Reconstruction of , Then you can get the loss directly $L=-D_{KL}(a(z|x_{img,x_{sketch}})||p(z|x_{sketch}))+E[log p(x_{img}|z,x_{sketch})]$ In order to retain sketch Potential alignment relationship , Join the reconstruction loss, That is... In the picture regularrization loss： $L_{rec}=\lambda||f_{NN}(x'_{img})-x_{sketch}||^2_2$
CAAE Is to use an antagonistic self encoder . alike , Continue to use GAN The idea of confrontation , The previous feature generator acts as a generator G To minimize losses $E_z[log p(x_{img}|z,x_{sketch})]+E_{img}[log (1-D(E(x_{img})))]$ And the discriminator D Maximize losses $E_[log[D(z)]]+E_{img}[log (1-D(E(x_{img})))]$ Also add reconstruction loss $L_{rec}=\lambda||f_{NN}(x'_{img})-x_{sketch}||^2_2$

The author's experiment proves CVAE Than CAAE Better , Probably because of CAAE The training of confrontation model is unstable .

Insert picture description here
Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval
come from CVPR2019. Its complex structure is shown in the figure above , share 4 Generators and 3 A discriminator .

The four generators have their own functions , Learn mapping semantics in different directions , It can take into account the learning of semantics in the mode , It can also complete cross modal semantic alignment $G_{sk}：X->S, G_{im}：Y->S, F_{sk}：S->X, F_{im}：S->Y$
The discriminator is also corresponding to distinguishing its own characteristics in the two modes , There is another one for distinguishing the characteristics between modes .

What's more interesting is that Cycle Consistency Loss, This blogger is in Cross modal Retrieval Have been sorted out , It's an old way to solve cross modal . So that features can not only be mapped to the corresponding semantic space , It can also map back to the original feature space from the semantic space , This can strengthen the learning of characteristics . $L_{cyc}=E[||F_{sk}(G_{sk}(x))-x||_1]+E[||G_{sk}(F_{sk}(s))-s||_1]$

Insert picture description here
Doodle to Search Practical Zero-Shot Sketch-based Image Retrieval
come from CVPR2019. Or try to find a mapping relationship , Just introduce GRL Provide Reward To guide embedding . There will be three losses ：

Triplet loss, Construct positive case pairs and negative case pairs , Then this makes pair The score of belonging to the same class is higher than that of different classes .
Domain loss, Use GRL Project the features of the two modes into the same space , To get a domain independent embedding .
Semantic loss, There will be introduction word Embedding To strengthen the connection between the two . That is, mandatory embedding contains semantic information by reconstructing word meaning .

The final loss function consists of three of them ： $L=\alpha_1L_t+\alpha_2L_d+\alpha_3L_s$

Insert picture description here
Zero-Shot Sketch-Based Image Retrieval via Graph Convolution Network
come from AAAI2020. The author believes that the above generation models , For generating possible image features, side information cannot be effectively used , And unstable . So a GCN Model to alleviate the above shortcomings . The model diagram is shown above ,SketchGCN The model contains three subnetworks , Coding network 、 Semantic maintenance network and semantic reconstruction network .

Coding networks try to embed sketches and images into a common semantic space .
The semantic preserving network takes features as input , Use side information to force them to maintain category level relationships . Here is mainly to learn the relationship between categories （ After all, the key to the task is from seen To unseen Learning from , So category knowledge is very important ）, To transfer knowledge . So here we directly use the feature information to compose the picture, and then GCN That's it . $H^{(l+1)}=\sigma(A'H^{(l)}W^{(l)})$ The graph construction here actually uses semantic features to calculate similarity $a^{i,j}=e^{-\frac{||s_i-s_j||^2_2}{t}}$
The semantic reconstruction network further forces the extracted features to retain their semantic relationships . Here is the same as the previous models , use CVAE, restructure loss, semantics loss And so on to constrain the learning of space .

Insert picture description here
Learning Cross-Aligned Latent Embedding for Zero-Shot Cross-Modal Retrieval
come from AAAI2020, This work is to use text to cross modal retrieval . Methods do not directly use class embedding as semantic space , And trained a multimode variational automatic encoder (VAE), The potential embedding of learning , Especially with class As bridge, Then align by matching their parametric distributions . The model is shown above , First learn one for each of these three modes VAE, then image and text Do the conversion of circular consistency , Then rebuild each other .

What is more meaningful is that two constraints are made in the alignment of transmembrane space ：

Take class embedding as a bridge , Align the potentially embedded multivariate Gaussian distribution in pairs . Specifically, I counted one 2-Wasserstein distance.
Because the association between image and text patterns is built implicitly through class embedding , Therefore, another scheme is considered here to explicitly enhance the semantic relevance of these two patterns . Specifically, forget maximum mean discrepancy (MMD).

final loss Or the above loss The sum of .

Insert picture description here
Correlated Features Synthesis and Alignment for Zero-Shot Cross-Modal Retrieval
come from SIGIR2020, The author is the same as the last article , So the practice is similar , But will VAE Instead of GAN, Then do work within and between modes .

The model architecture diagram is shown above , First pair class and image Between doing WGAN, Right again class and text do WGAN（ So in fact, the point similar to the previous article is that class As a bridge ）, Then it is also calculated by respective discriminators loss, Then calculate unified Semantic Space Cyclic consistency and distribution alignment loss, Finally, it's all about the whole episode loss.

原网站

版权声明
本文为[Shangshanxianger]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251108383940.html

当前位置：网站首页>Zero shot image retrieval (zero sample cross modal retrieval)

Zero shot image retrieval (zero sample cross modal retrieval)

边栏推荐

猜你喜欢

随机推荐