当前位置：网站首页>[semantic segmentation] setr_ Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformer

[semantic segmentation] setr_ Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformer

2022-07-29 06:04:00 【Dull cat】

List of articles

One 、 The main idea

Among the current methods of semantic segmentation , mainly FCN Based on encoder-decoder Methods , But this kind of method is catching long-range Weak ability in information ：

In order to improve receptive field , There is PSP/ASPP/attention Other methods
These methods mainly use the feature map of the original image after down sampling , That is to use high-level information to improve the receptive field , Lack of utilization of low-level information .

Methods of this paper ：

A completely used transformer The method of semantic segmentation of
Input ： Divide the original drawing into fixed size patch, To form a sequece of image patch, Then use linear embedding layer To get a sequence of feature embedding vectors As transformer The input of

Two 、 Implementation method

1、 Turn the picture into a serialized patch：

take $x\in R^{H\times W \times 3}$ Cut into uniform size $\frac{H}{16} \times \frac{W}{16}$ , And then put these patch Flatten

2、 Linear projection ： $\to e \in R^C$

Use linear projection f take patch Map to a C Dimensional embedding space, So from a 2 Dimensional images get a one-dimensional sequence

3、 Location code ：

For each patch Spatial information coding , The author gives each position $i$ Upper patch Learned a special location code $p_i$ , Add to $e_i$ On , To get the final input $E = {e_1+p_1, e_2+p_2,...,e_l+p_l}$

4、Transformer：

Transformer Using the above E As input , It means that it can get the overall feeling field , solve FCN And other methods to feel the problem of limited fields .

Insert picture description here
4、Decoder

Decoder The role of ： Generate the same size as the original 2 Dimension segmentation results

therefore , Here we need to put encoder Characteristics of Z from $\frac{HW}{256}$ reshape become $\frac{H}{16} \times \frac{W}{16} \times {C}$ .

Method 1 ：Naive upsampling (Naive)

① take transformer The resulting features $Z^{L_e}$ Number of mapped to split categories （ Such as cityscape Namely 19）

1x1 conv + sync batch norm (with relu) + 1x1 conv

② Use bilinear interpolation for up sampling , And then calculate loss

Method 2 ：Progressive UPsampling （PUP）

Use progressive upsampling , Using convolution kernel sampling alternating transform to achieve , In order to avoid the error caused by multiple direct sampling , This upsampling method only upsampling each time 2 times , That is to say, if you want to set the size to $\frac{H}{16} \times \frac{W}{16}$ Of $Z^{L_e}$ Up sampling to the size of the original image , Need to carry out 4 operations .
Insert picture description here

Method 3 ：Multi-Level feature Aggregation（MLA）

Use multi-layer feature aggregation , That is, the characteristics of cross layer distribution $\{Z^m\} (m \in \{\frac{L_e}{m}, 2\frac{L_e}{m},..., M\frac{L_e}{m}\})$ As input （ The interval step is $\frac{L_e}{m}$ ）, Input to decoder in .

after , Deployed M A flow （stream）, Each focuses on one layer, Within each stream ：

First of all, will encoder Characteristics of $Z_l$ from 2D ( $\frac{HW}{256} \times C$ ) reshape To 3D ( $\frac{H}{16} \times \frac{W}{16} \times C$ )
Then use a 3 Layer network (kelnel size 1x1,3x3,3x3) ：
- Halve the number of channels on the first and third floors respectively , And after the third floor , Use bilinear interpolation to sample the feature map 4 times .
- Secondly, in order to improve the information interaction between different flows , The author introduces a top-down The aggregation mechanism of point by point addition of , And after the addition operation, a 3x3 Convolution
After the third floor , Can pass concat Let's take all the stream The characteristics of , Then use bilinear interpolation to sample the feature map 4 Double the size of the original .

5、Auxiliary loss：

Every auxiliary loss Followed by 2 Layer network , The author in the following different transformer layers Add the all auxiliary loss：

SETR-Naive ( $Z^{10},Z^{15},Z^{20}$ )
SETR-PUP ( $Z^{10},Z^{15},Z^{20},Z^{24}$ )
SETR-MLA ( $Z^{6},Z^{12},Z^{18},Z^{24}$ )

3、 ... and 、 Realization effect

stay ADE20K and Pascal VOC The effect of ：

Insert picture description here

stay cityscape Comparison of the results on ：
Insert picture description here
Visualization of different layers ：

Four 、 Code

Code path ：https://github.com/fudan-zvg/SETR

frame ：mmsegmentation

# 1、source  Environmental Science 
source activate mmsegmentation
# 2、 Compile library path 
python setup develop.py

Insert picture description here

# config  file 
configs/SETR/

Insert picture description here
Modify data path ：

configs/_base_/dataset/cityscapes/py

Insert picture description here

原网站

版权声明
本文为[Dull cat]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290519388032.html

当前位置：网站首页>[semantic segmentation] setr_ Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformer

[semantic segmentation] setr_ Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformer

List of articles

One 、 The main idea

Two 、 Implementation method

3、 ... and 、 Realization effect

Four 、 Code

边栏推荐

猜你喜欢

随机推荐