当前位置：网站首页>[small sample segmentation] msanet: multi similarity and attention guidance for boosting few shot segmentation

[small sample segmentation] msanet: multi similarity and attention guidance for boosting few shot segmentation

2022-07-27 04:18:00 【Chestnut vegetable】

Insert picture description here
The article links ：MASNet
Code link ：MSANet-code

Abstract

The purpose of small sample segmentation is to segment invisible class objects when there are only a few densely labeled samples . Prototype learning , That is, the features extracted from the supporting image generate a single or multiple prototypes by averaging the global and local object information , It has been widely used in FSS. However , Only using prototype vectors may not be enough to represent the features of all supporting images . In order to extract rich features and make more accurate prediction , We propose a multi similarity and attention network （MSANet）, Includes two new modules , A multi similarity module and an attention module . The multi similarity module uses multiple feature maps of supporting images and query images to estimate accurate semantic relationships . Attention module indication MSANet Focus on category related information .
Insert picture description here

Introduce

With the maturity of large-scale data sets [9,10,13,26] Development of , A series of supervised Convolutional Neural Networks （CNN） In semantic segmentation task [1,34,40,41,49] Shows great potential . The performance of these supervised neural networks depends largely on the quality and quantity of training data sets , For example, the number of well annotated data 、 Balance of class distribution and sample representation . However , in application , It's hard to get a lot of annotated data , Especially in intensive prediction tasks [2,3,14,21,57,59]. Besides , Traditional supervised neural networks may be difficult to generalize images with invisible classes . Inspired by the cognitive ability that humans use only a small amount of input data to distinguish objects , Developed small sample learning （FSL） technology [8,42,53,56]. This technology builds a network , It can be extended to have few available annotation samples . Small sample segmentation （FSS）[27-29,31,32,35,36,45,46,48,51,54,58,60,61,63,64] It is one of the applications of small sample learning , Especially focus on semantic segmentation .FSS The goal of is to segment the target area of the selected category in the query image using the corresponding annotation mask .FSS The most popular method is measurement based prototype learning [51]. Refer to the figure 1 The top half of ）, Average pool through shielding （MAP） Generate single or multiple class representative prototype vectors [67]. The feature processing network uses the class representative prototype vector to segment the target object in the query image . Many researchers try to get more guidance from the prototype vector using different mechanisms , for example ,PANet[55]、PFENet[51]、SG One Net[67]、CANet[65]、ASGNet[22]. However , Average pool operation due to shielding , Such prototype networks may lose detailed spatial information of images . under these circumstances , We propose a multi similarity and attention network composed of two guidance modules （MSANet）. Refer to the figure 1 The bottom half , The network includes multi-layer similarity module and attention module . It is expected that these two modules will support the prototype learning paradigm , And guide MSANet Fine segmentation . Recent research shows that , You can use the visual correspondence between the supporting image and the query image [12] To upgrade FSS The Internet . In order to establish a more meaningful correspondence , Dense middle layer [33,37,38] And related tensor learning [24,43,52] technology .Juhong Minet They designed HSNet[36], A new method based on 4D Tensor supercorrelation squeezing network with multilayer dense feature correlation . Besides , We propose a multi similarity module , This module extracts multi-layer feature correlation from the backbone network , And apply a simple convolution block to the feature . We also propose a lightweight CNN Attention block , Target class content for paying more attention to images . follow BAM[20] The architecture of , We use a basic learner and integration module to refine the segmentation results . We will be right FSS The main contributions of the challenge are summarized below ：
A multi-layer similarity module is proposed , In order to obtain the information visual correspondence between the supporting image and the query image .
We propose a simple but effective attention module , Use the supporting image and its corresponding mask to better understand the class related information .
Insert picture description here

Related work

Semantic segmentation

Semantic segmentation is one of the computer vision tasks to classify each pixel in a given image within a specified category . Thanks to the full convolution network ( FCNs ) Progress , Many model structures such as encoder based - Decoder UNet 、 Based on pyramid pool module ( PPM ) Of PSPNet And based on Atrous Spatial Pyramid pool ( ASPP ) Of Deeplab Etc. are proposed to improve segmentation performance . Besides , A series of vision technologies are also proposed , Including extended convolution 、 Multi level feature aggregation and attention mechanism . However , The traditional segmentation model needs enough annotation data , It is difficult to predict invisible categories without fine-tuning , Thus, it hinders the practical application to a certain extent .
Insert picture description here

Insert picture description here

Small sample learning

To solve these problems , Introduced FSL, The purpose is to understand invisible categories with a small number of annotated samples .FSL The method can be further subdivided into three branches ：( i ) Optimized based [ 11、 19、 42 ]、( ii ) Based on enhanced [ 6,7 ] and ( iii ) Measurement based [ 23,48,50 ]. Based on the optimization method, a gradient update strategy is proposed , To overcome data bias , Improve the generalization ability of the model . The method based on enhancement solves the shortage of data by generating synthetic training images . Our work is closely related to measurement based methods , These methods aim to learn a general measurement function to calculate the distance between the query image and the supporting image . These measurement based methods have made outstanding progress . As one of them , The matching network uses a special small batch called set to match the training and testing environment . Relational networks transform query and support images into 1x1 The vector is then based on cosine similarity ( CS ) To classify . Besides , A prototype network is also proposed , It directly utilizes the feature representation calculated by the global average pool operation ( Prototype )
Insert picture description here

Semantic segmentation of small samples

：Shaban And so on OSLSM, As FSS One of the pioneering work of , Used to generate classifier weights for query image segmentation . The first branch takes the support image as the input , Generate a parameter vector , The second branch combines these parameters with the query image , Split mask as output . And then , In order to better extract information from supporting images and query images , Prototype learning paradigm is introduced .SG-One The masking average pool operation is introduced for computing class representative prototype vectors , Generate a spatial similarity graph .CANet Two dense comparison networks with iterative refinement modules are proposed .PFENet The cosine similarity of high-level characteristics is calculated without trainable parameters （CS）, To create a priori mask and input it, a feature rich module is also introduced .ASGNet No prototype extension , Instead, a super pixel guided clustering method is proposed , Extract multiple prototypes from supporting images , And use the allocation strategy to reconstruct the support characteristic graph . However , Most prototype learning methods will lead to the loss of spatial structure . In order to fully mine the characteristics of foreground objects , There is still room for improvement in using classes to represent prototype vectors . On the other hand , stay FSS It is found that the visual correspondence and processing correlation tensor show significant results [ 36 - 38 ].HSNet Trained to compress dense feature correlation tensors , It is transformed into a segmentation mask by high-dimensional convolution . However , High dimensional convolution ( 4D Convolution ) It has high space complexity and time complexity . To extract lightweight CNN features ,DENet A guided attention module is introduced to estimate the weight of the new classifier inspired by tradition . The literature 17 This paper proposes an attention based multi context guidance network , Integrate small to large-scale context information , Guide the query branch globally .BAM There is no feature extraction or visual correspondence , But for FSS Introduced a new way , It uses additional blocks of the supervised model for base class training . The supervision model predicts the base class from the query image , Help meta learners suppress false predictions . Inspired by the recent research progress of visual correspondence and attention mechanism , We propose a multi-layer similarity module and a lightweight attention module in the context of the prototype network , take FSS Network to the next level .
Insert picture description here