当前位置:网站首页>[reading point paper] deeplobv3+ encoder decoder with Atlas separable revolution

[reading point paper] deeplobv3+ encoder decoder with Atlas separable revolution

2022-06-13 02:20:00 Shameful child

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

  • Deep neural network uses spatial pyramid pool module or codec structure for semantic segmentation .

    • Space pyramid pool module

      • Multi scale context information can be encoded by detecting incoming features through filter or pool operation at multiple rates and multiple effective fields of view
      • Capture rich contextual information , By concentrating features on different resolutions
    • Codec structure

      • Clearer target boundaries can be captured by gradually recovering spatial information
      • pyramid network : High precision , But the amount of calculation is too large , So the running time is long .
      • Codec network : A small amount of calculation , But the accuracy is relatively low .
  • DeepLabv3+

    • Expanded DeepLabv3, A simple and effective decoder module is added to refine the segmentation results , Especially along object boundaries .

    • Use DeepLabv3 As a powerful encoder module And a simple and effective decoder module .

    • In order to obtain context information on multiple scales ,DeepLabv3 Applied Multiple parallel at different rates atrous Convolution

    • The proposed model DeepLabv3 + Extend by adding a simple and effective decoder module DeepLabv3, To optimize segmentation results , Especially along object boundaries . We Further exploration Xception Model The depth separable convolution is applied to Atrous Spatial Pyramid Pooling And decoder module , This produces a faster and stronger encoder - Decoder network .

    • The proposed model DeepLabv3+ Contains Rich semantic information from encoder module , and Detailed object boundaries are recovered by a simple and efficient decoder module .

    • The encoder module allows us to pass application atrous Convolution can extract features at any resolution .

    •  Insert picture description here

    • Rich semantic information is encoded into DeepLabv3 In the output of , Use atrous Convolution It is allowed to control the density of encoder features according to the budget of computing resources .

    • take The depth separable convolution is applied to ASPP Module and decoder module , So as to get faster 、 Stronger codec network .

      • atrous separable convolution

      • At present, there are mainly two kinds of separable convolution : Space separable convolution and Depth separates the convolution .

      • Spatially separable convolution has some obvious limitations , This means that it is not widely used in deep learning .

        • It is called spatially separable convolution , Because it mainly deals with the image and the space size of the kernel : Width and height .

        • Spatially separable convolution simply divides a kernel into two smaller kernels . The most common case is to 3x3 The kernel is divided into 3x1 and 1x3 kernel

          •  Insert picture description here
        > -  We don't need to make one 9 Convolution of times multiplication , But do two 3 Times multiplication ( common 6 Time ) Convolution of , To achieve the same effect 
      
      • Depthwise Separable Convolutions

        • The deep separable convolution is so named , Because it Not only the spatial dimension , And it involves the depth dimension ( The channel number )
        • Depth can be divided into convolution or group convolution , This is a A powerful operation , It can reduce the calculation cost and the number of parameters , While maintaining similar performance .
        • The deep separable convolution is applicable to the impossibility “ decompose ” A kernel for two smaller kernels .
        • Deep separable convolution splits a kernel into two independent kernels , These kernels perform two convolutions : Convolution by depth (depthwise convolution) And point by point convolution (pointwise convolution).
        • Depthwise Convolution
          •  Insert picture description here
        -  Every 5x5x1 The kernel will iterate over the image 1 Channels ( Be careful :1 Channels , Not all channels ), Get every 25 Scalar product of pixel groups , So it gives 8x8x1 Images .  Stacking these images together will create a 8x8x3 Image .
        - ![ Insert picture description here ](https://img-blog.csdnimg.cn/3cafd388caeb4ae798c0aca02a8601bb.png#pic_center)
      
      - Pointwise Convolution
        -  Pointwise convolution is so named , Because it uses 1x1 Or iterate through the kernel at each point .  The depth of the kernel is the number of channels the input image has .
        -  take 1x1x3 Kernel in 8x8x3 Iterate over the image , In order to obtain 1 Zhang 8x8x1 Image .
        - ![ Insert picture description here ](https://img-blog.csdnimg.cn/1cc030d74b154f4a91360d36a6239916.png#pic_center)
      
        -  You can create 256 individual 1x1x3 kernel , Each kernel outputs 8x8x1 Images , To get the shape of 8x8x256 The final image of .
        - ![ Insert picture description here ](https://img-blog.csdnimg.cn/e2cccff40253404cbfc74ca0cc02355c.png#pic_center)
      
      • If the original ordinary convolution is 12x12x3-(5x5x3x256)→8x8x256, We can explain this new convolution as 12x12x3-(5x5x1x3)->(1x1x3x256)-> 8x8x256.

      • https://zhuanlan.zhihu.com/p/197528715
        
      •  Insert picture description here

      • 3×3 The deep separable convolution decomposes the standard convolution into **(a) Deep convolution **( Apply a filter to each input channel ) and **(b) Pointwise convolution **( Combined with the output of cross channel depth convolution ). greatly It reduces the computational complexity .

      • Depth convolution performs spatial convolution for each input channel independently , Point convolution is used to combine the output of depth convolution .

    • The implementation process of the method in this paper

      1. So let's set up a Large encoding and decoding structure , among The coding structure is ASPP Instead of . The characteristic diagram of this part is connected through 1*1 After convolution with convolution kernel, the characteristic graph of a certain number of channels is obtained , after 4 Times of upsampling to get a set of characteristic graphs Features1.
      2. A set of low-level feature maps in the coding process ( And Features1 Same scale ) Pull it out , after 1*1 A set of characteristic graphs are obtained by adjusting the number of channels Features2. Here we use 1*1 Convolution kernel adjusts the number of channels because the number of channels is usually large when extracting low-level features (256), This leads to a large proportion of low-level features , It's not easy to train .
      3. take Features1 and Features2 Connect . after 3*3 Feature extraction using convolution kernel , Then go through 4 Times upsampling to get the image output with the same scale as the original image .
  • The goal of semantic segmentation is to Each pixel is assigned a semantic label

  • Encoder - Decoder network

    • Encoder

      • This module can gradually reduce feature mapping and capture higher semantic information
      • deeplabv3 As an encoder
        • DeepLabv3 use atrous Convolution to extract the features of depth convolution neural network calculation with arbitrary resolution .
        • DeepLabv3 Enhanced Atrous Space pyramid pool module , The module uses different rates of Atrous Convolution Detecting convolution features at multiple scales , There are also image level features
        • Use primitive DeepLabv3 in logits Previous The last feature map is output as an encoder .
    • decoder

      • The decoder module can gradually recover the spatial information

        1. The encoder features are first characterized by 4 A multiple of times Bilinear up sampling , Then with from The corresponding low-order characteristic connection of the network backbone with the same spatial resolution .
        2. Apply another 1×1 The low-level function of convolution to reduce the number of channels , Because the corresponding low-level function usually contains a large number of channels, it may be more important than the rich encoder characteristics in our model and the difficulty of training .
        3. Apply a few 3×3 Convolution to refine features , Then a simple bilinear upsampling , The upper sampling multiple is 4.
          • This paper shows how to use output stride= 16 The best compromise between speed and accuracy can be achieved .
          • When the encoder module uses output stride= 8 when , Slightly improved performance , But the cost is additional computational complexity .
    •  Insert picture description here

    • The encoder module passes Apply on multiple scales atrous Convolution Encode multi-scale context information , and The simple and effective decoder module refines the segmentation result along the target boundary .

  • Atrous Convolution

    • It's a powerful tool , It allows us to explicitly control the resolution of the features calculated by the depth convolution neural network , And adjust the field of view of the filter , To capture multiscale information , Generalize the standard convolution operation .
  • Xception

    • take Xception The model is used to segment tasks , The depth separable convolution is applied to ASPP Module and decoder module , This produces a faster and stronger encoder - Decoder network .

    • MSRA The team modified Xception Model ( be called Aligned Xception),

    • In this paper MSRA The team modified Xception The model is being modified

      • The portal stream network structure has not been modified , For fast computing and memory efficiency
      • be-all Max The pool operation is replaced by the depth divisible convolution with span , This enables us to apply atrous The feature map with arbitrary resolution can be extracted by dividing convolution ( Another option is to atrous The algorithm is extended to Max Pool operation ).
      • Additional batch normalization and ReLU Activate at each 3×3 Add... After deep convolution , Be similar to MobileNet Design
    •  Insert picture description here

      1. Added layers ( And MSRA The changes are the same , except Entry flow Changes )
      2. be-all max pooling All operations are carried stride Is replaced by the depth separable convolution
      3. Every 3×3 Add additional batch normalization sum after deep convolution ReLU, Be similar to MobileNet.
  • The experiment of this paper uses ImageNet-1k In the process of the training ResNet-101, Plus improved aligned Xception adopt atrous Convolution to extract dense feature graph . Our implementation is based on TensorFlow Above .

  • The model proposed in this paper is end-to-end training , There is no need to pre train each component . The proposed decoder module contains batch normalization parameters .

  • Simple bilinear upsampling can be seen as a naive decoder design

  • In the decoder module , This paper considers three different design options

    • Use 1×1 Convolution is used to reduce the channel of low-order feature mapping of encoder module

      • Experiments show that : Reduce the channel of low-level feature mapping of encoder module to 48 or 32, Performance will be better . So we use [1×1,48] Channel reduction .
      •  Insert picture description here
    • Use 3×3 Convolution for clearer segmentation results

      • Experiments show that : take Conv2 feature map( Before you stride ) And DeepLabv3 feature map After connection , Use two 3×3 Convolution sum 256 Filters are more efficient than simply using one or three convolutions .
      •  Insert picture description here
    • Which encoder lower order feature should be used .

      • A very simple but effective decoder module is used : Through two [3×3,256] Operation to refine DeepLabv3 Characteristic diagram and channel reduction Conv2 Connection of characteristic diagram .
  • ResNet-101 as Network Backbone VS Xception as Network Backbone

    • Respectively embedded Resnet101 And Xception A comparative experiment was carried out , The experimental results show that Xception The result is better .

      • Xceoption There are the following changes :
        1. The number of layers has deepened
        2. All maximum pooling has been replaced with 3x3 with stride 2 Of separable convolution
        3. At every 3x3 depthwise separable convolution Added after BN and ReLU
  • Improvement along Object Boundaries

  • PASCAL VOC 2012 Test set results and best performing models

    •  Insert picture description here
  • The proposed model “DeepLabv3+” Using encoders - Decoder structure , among DeepLabv3 Used to encode rich context information , A simple and effective decoder module is used to recover the object boundary .

  • The anomaly model and atrous Separable convolution , Make the model faster and stronger .

  • https://arxiv.org/pdf/1802.02611.pdf
    

eepLabv3+” Using encoders - Decoder structure , among DeepLabv3 Used to encode rich context information , A simple and effective decoder module is used to recover the object boundary .

  • The anomaly model and atrous Separable convolution , Make the model faster and stronger .

  • https://arxiv.org/pdf/1802.02611.pdf
    
原网站

版权声明
本文为[Shameful child]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202280543149220.html