当前位置：网站首页>[reading point paper] deeplobv3+ encoder decoder with Atlas separable revolution

[reading point paper] deeplobv3+ encoder decoder with Atlas separable revolution

2022-06-13 02:20:00 【Shameful child】

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep neural network uses spatial pyramid pool module or codec structure for semantic segmentation .
- Space pyramid pool module
  - Multi scale context information can be encoded by detecting incoming features through filter or pool operation at multiple rates and multiple effective fields of view
  - Capture rich contextual information , By concentrating features on different resolutions
- Codec structure
  - Clearer target boundaries can be captured by gradually recovering spatial information
- - pyramid network : High precision , But the amount of calculation is too large , So the running time is long .
  - Codec network ： A small amount of calculation , But the accuracy is relatively low .
DeepLabv3+
- Expanded DeepLabv3, A simple and effective decoder module is added to refine the segmentation results , Especially along object boundaries .
- Use DeepLabv3 As a powerful encoder module And a simple and effective decoder module .
- In order to obtain context information on multiple scales ,DeepLabv3 Applied Multiple parallel at different rates atrous Convolution
- The proposed model DeepLabv3 + Extend by adding a simple and effective decoder module DeepLabv3, To optimize segmentation results , Especially along object boundaries . We Further exploration Xception Model The depth separable convolution is applied to Atrous Spatial Pyramid Pooling And decoder module , This produces a faster and stronger encoder - Decoder network .
- The proposed model DeepLabv3+ Contains Rich semantic information from encoder module , and Detailed object boundaries are recovered by a simple and efficient decoder module .
- The encoder module allows us to pass application atrous Convolution can extract features at any resolution .
- Rich semantic information is encoded into DeepLabv3 In the output of , Use atrous Convolution It is allowed to control the density of encoder features according to the budget of computing resources .
- take The depth separable convolution is applied to ASPP Module and decoder module , So as to get faster 、 Stronger codec network .
- - atrous separable convolution
  - At present, there are mainly two kinds of separable convolution ： Space separable convolution and Depth separates the convolution .
  - Spatially separable convolution has some obvious limitations , This means that it is not widely used in deep learning .
    - It is called spatially separable convolution , Because it mainly deals with the image and the space size of the kernel ： Width and height .
    - Spatially separable convolution simply divides a kernel into two smaller kernels . The most common case is to 3x3 The kernel is divided into 3x1 and 1x3 kernel
```
  > -  We don't need to make one 9 Convolution of times multiplication , But do two 3 Times multiplication （ common 6 Time ） Convolution of , To achieve the same effect 
```
  - Depthwise Separable Convolutions
    - The deep separable convolution is so named , Because it Not only the spatial dimension , And it involves the depth dimension （ The channel number ）
    - Depth can be divided into convolution or group convolution , This is a A powerful operation , It can reduce the calculation cost and the number of parameters , While maintaining similar performance .
    - The deep separable convolution is applicable to the impossibility “ decompose ” A kernel for two smaller kernels .
    - Deep separable convolution splits a kernel into two independent kernels , These kernels perform two convolutions ： Convolution by depth （depthwise convolution） And point by point convolution （pointwise convolution）.
    - Depthwise Convolution
```
  -  Every 5x5x1 The kernel will iterate over the image 1 Channels （ Be careful ：1 Channels , Not all channels ）, Get every 25 Scalar product of pixel groups , So it gives 8x8x1 Images .  Stacking these images together will create a 8x8x3 Image .
  - ![ Insert picture description here ](https://img-blog.csdnimg.cn/3cafd388caeb4ae798c0aca02a8601bb.png#pic_center)
```
```
- Pointwise Convolution
  -  Pointwise convolution is so named , Because it uses 1x1 Or iterate through the kernel at each point .  The depth of the kernel is the number of channels the input image has .
  -  take 1x1x3 Kernel in 8x8x3 Iterate over the image , In order to obtain 1 Zhang 8x8x1 Image .
  - ![ Insert picture description here ](https://img-blog.csdnimg.cn/1cc030d74b154f4a91360d36a6239916.png#pic_center)
```
```
  -  You can create 256 individual 1x1x3 kernel , Each kernel outputs 8x8x1 Images , To get the shape of 8x8x256 The final image of .
  - ![ Insert picture description here ](https://img-blog.csdnimg.cn/e2cccff40253404cbfc74ca0cc02355c.png#pic_center)
```
  - If the original ordinary convolution is 12x12x3-（5x5x3x256）→8x8x256, We can explain this new convolution as 12x12x3-（5x5x1x3）->（1x1x3x256）-> 8x8x256.
  - ```
  https://zhuanlan.zhihu.com/p/197528715
```
- 3×3 The deep separable convolution decomposes the standard convolution into **(a) Deep convolution **( Apply a filter to each input channel ) and **(b) Pointwise convolution **( Combined with the output of cross channel depth convolution ). greatly It reduces the computational complexity .
- Depth convolution performs spatial convolution for each input channel independently , Point convolution is used to combine the output of depth convolution .
- The implementation process of the method in this paper
  1. So let's set up a Large encoding and decoding structure , among The coding structure is ASPP Instead of . The characteristic diagram of this part is connected through 1*1 After convolution with convolution kernel, the characteristic graph of a certain number of channels is obtained , after 4 Times of upsampling to get a set of characteristic graphs Features1.
  2. A set of low-level feature maps in the coding process （ And Features1 Same scale ） Pull it out , after 1*1 A set of characteristic graphs are obtained by adjusting the number of channels Features2. Here we use 1*1 Convolution kernel adjusts the number of channels because the number of channels is usually large when extracting low-level features （256）, This leads to a large proportion of low-level features , It's not easy to train .
  3. take Features1 and Features2 Connect . after 3*3 Feature extraction using convolution kernel , Then go through 4 Times upsampling to get the image output with the same scale as the original image .
The goal of semantic segmentation is to Each pixel is assigned a semantic label
Encoder - Decoder network
- Encoder
  - This module can gradually reduce feature mapping and capture higher semantic information
  - deeplabv3 As an encoder
    - DeepLabv3 use atrous Convolution to extract the features of depth convolution neural network calculation with arbitrary resolution .
    - DeepLabv3 Enhanced Atrous Space pyramid pool module , The module uses different rates of Atrous Convolution Detecting convolution features at multiple scales , There are also image level features
    - Use primitive DeepLabv3 in logits Previous The last feature map is output as an encoder .
- decoder
  - The decoder module can gradually recover the spatial information
  - 1. The encoder features are first characterized by 4 A multiple of times Bilinear up sampling , Then with from The corresponding low-order characteristic connection of the network backbone with the same spatial resolution .
    2. Apply another 1×1 The low-level function of convolution to reduce the number of channels , Because the corresponding low-level function usually contains a large number of channels, it may be more important than the rich encoder characteristics in our model and the difficulty of training .
    3. Apply a few 3×3 Convolution to refine features , Then a simple bilinear upsampling , The upper sampling multiple is 4.
      This paper shows how to use output stride= 16 The best compromise between speed and accuracy can be achieved .
      When the encoder module uses output stride= 8 when , Slightly improved performance , But the cost is additional computational complexity .
- The encoder module passes Apply on multiple scales atrous Convolution Encode multi-scale context information , and The simple and effective decoder module refines the segmentation result along the target boundary .
Atrous Convolution
- It's a powerful tool , It allows us to explicitly control the resolution of the features calculated by the depth convolution neural network , And adjust the field of view of the filter , To capture multiscale information , Generalize the standard convolution operation .
Xception
- take Xception The model is used to segment tasks , The depth separable convolution is applied to ASPP Module and decoder module , This produces a faster and stronger encoder - Decoder network .
- MSRA The team modified Xception Model ( be called Aligned Xception),
- In this paper MSRA The team modified Xception The model is being modified
  - The portal stream network structure has not been modified , For fast computing and memory efficiency
  - be-all Max The pool operation is replaced by the depth divisible convolution with span , This enables us to apply atrous The feature map with arbitrary resolution can be extracted by dividing convolution ( Another option is to atrous The algorithm is extended to Max Pool operation ).
  - Additional batch normalization and ReLU Activate at each 3×3 Add... After deep convolution , Be similar to MobileNet Design
- 1. Added layers ( And MSRA The changes are the same , except Entry flow Changes )
  2. be-all max pooling All operations are carried stride Is replaced by the depth separable convolution
  3. Every 3×3 Add additional batch normalization sum after deep convolution ReLU, Be similar to MobileNet.
The experiment of this paper uses ImageNet-1k In the process of the training ResNet-101, Plus improved aligned Xception adopt atrous Convolution to extract dense feature graph . Our implementation is based on TensorFlow Above .
The model proposed in this paper is end-to-end training , There is no need to pre train each component . The proposed decoder module contains batch normalization parameters .
Simple bilinear upsampling can be seen as a naive decoder design
In the decoder module , This paper considers three different design options
- Use 1×1 Convolution is used to reduce the channel of low-order feature mapping of encoder module
  - Experiments show that ： Reduce the channel of low-level feature mapping of encoder module to 48 or 32, Performance will be better . So we use [1×1,48] Channel reduction .
- Use 3×3 Convolution for clearer segmentation results
  - Experiments show that ： take Conv2 feature map( Before you stride ) And DeepLabv3 feature map After connection , Use two 3×3 Convolution sum 256 Filters are more efficient than simply using one or three convolutions .
- Which encoder lower order feature should be used .
  - A very simple but effective decoder module is used : Through two [3×3,256] Operation to refine DeepLabv3 Characteristic diagram and channel reduction Conv2 Connection of characteristic diagram .
ResNet-101 as Network Backbone VS Xception as Network Backbone
- Respectively embedded Resnet101 And Xception A comparative experiment was carried out , The experimental results show that Xception The result is better .
- - Xceoption There are the following changes ：
    1. The number of layers has deepened
    2. All maximum pooling has been replaced with 3x3 with stride 2 Of separable convolution
    3. At every 3x3 depthwise separable convolution Added after BN and ReLU
Improvement along Object Boundaries
PASCAL VOC 2012 Test set results and best performing models
The proposed model “DeepLabv3+” Using encoders - Decoder structure , among DeepLabv3 Used to encode rich context information , A simple and effective decoder module is used to recover the object boundary .
The anomaly model and atrous Separable convolution , Make the model faster and stronger .
```
https://arxiv.org/pdf/1802.02611.pdf
```

eepLabv3+” Using encoders - Decoder structure , among DeepLabv3 Used to encode rich context information , A simple and effective decoder module is used to recover the object boundary .

The anomaly model and atrous Separable convolution , Make the model faster and stronger .
```
https://arxiv.org/pdf/1802.02611.pdf
```

原网站

版权声明
本文为[Shameful child]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202280543149220.html

当前位置：网站首页>[reading point paper] deeplobv3+ encoder decoder with Atlas separable revolution

[reading point paper] deeplobv3+ encoder decoder with Atlas separable revolution

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

边栏推荐

猜你喜欢

随机推荐