当前位置：网站首页>BiSeNet v2

BiSeNet v2

2022-07-29 08:07:00 【00000cj】

paper：BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation

v2 Medium Detail Path and Semantic Path They correspond to each other v1 Medium Spatial Path and Context Path

and v1 comparison , There are mainly the following two improvements

Removed time-consuming cross layer connections , Simplified model structure .
Redesigned the overall architecture . Specific include （1） Deepened Detail Path To encode more details （2） about Semantic Path, Based on the depth separable convolution, a lightweight components（3） An effective aggregation layer To strengthen the connection between the two paths

Bilateral Segmentation Network

The overall structure is shown in the figure below

The specific structure of detail branch and semantic branch is shown in the following table

Detail Branch

The detail branch is responsible for extracting spatial detail information , namely low-level Information , Therefore, this branch needs rich channel capacity, that is, a large number of channels, so as to encode rich spatial details . At the same time, because this branch focuses on low-level Information , So it needs to be a stride Small shallow structure . In general, the number of channels and layers required for detailed branches is large . In addition, it is best not to use residual connection, Additional memory access costs reduce speed .

As shown in the table (1) Shown , Detail branch contains 3 individual stage, Every stage contain 2 Convolution layers , After each convolution layer, there is a BN And a ReLU, Every stage The first convolution of stride=2, Therefore, the size of the output characteristic graph of this branch is the input of the model 1/8.

The specific structure of the detail branch is as follows

DetailBranch(
  (detail_branch): ModuleList(
    (0): Sequential(
      (0): ConvModule(
        (conv): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
      (1): ConvModule(
        (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
    )
    (1): Sequential(
      (0): ConvModule(
        (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
      (1): ConvModule(
        (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
      (2): ConvModule(
        (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
    )
    (2): Sequential(
      (0): ConvModule(
        (conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
      (1): ConvModule(
        (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
      (2): ConvModule(
        (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activate): ReLU(inplace=True)
      )
    )
  )
)

Semantic Branch

At the same time, considering the large receptive field and small amount of calculation , The author draws lessons from lightweight networks such as Xception、MobileNet、ShuffleNet The structure of semantic branch is designed , Contrary to the characteristics of shallow layers with large number of channels in detail branches , Semantic branching requires the deep structure of the number of small channels , As follows

Stem Block

Adopted by the author Stem Block As the first semantic Branch stage, Here's the picture (a) Shown , It uses two different downsampling methods to reduce the feature representation , Then the output of the two branches is concatenate, This structure has high computational cost and feature expression ability .

Stem Block The specific structure is as follows

(stage1): StemBlock(
        (conv_first): ConvModule(
          (conv): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activate): ReLU(inplace=True)
        )
        (convs): Sequential(
          (0): ConvModule(
            (conv): Conv2d(16, 8, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (activate): ReLU(inplace=True)
          )
          (1): ConvModule(
            (conv): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (activate): ReLU(inplace=True)
          )
        )
        (pool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
        (fuse_last): ConvModule(
          (conv): Conv2d(32, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activate): ReLU(inplace=True)
        )
      )

Gather-and-Expansion Layer

Except for the first one stem block And the last context embedding block, Each in the middle of the semantic Branch stage It's all by GE layer Composed of , As shown in the figure below

GE Layers include （1） One 3x3 Convolution is used to effectively aggregate feature responses and extend them to high-dimensional space （2） One that extracts features separately on each channel 3x3 Deep convolution （3） One 1x1 Convolution maps the output of depth convolution to a low channel space .

When stride=2 when , In addition, use 2 individual 3x3 Depth convolution further expands the receptive field , And the depth separable convolution is used as shortcut.

Semantic branch of stage3 The structure of is as follows , Specific include 2 individual GE layer, first GE layer stride=2, the second GE layer stride=1

(stage2): Sequential(
        (0): GELayer(
          (conv1): ConvModule(
            (conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (activate): ReLU(inplace=True)
          )
          (dwconv): Sequential(
            (0): ConvModule(
              (conv): Conv2d(16, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=16, bias=False)
              (bn): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (1): ConvModule(
              (conv): Conv2d(96, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=96, bias=False)
              (bn): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (activate): ReLU(inplace=True)
            )
          )
          (shortcut): Sequential(
            (0): DepthwiseSeparableConvModule(
              (depthwise_conv): ConvModule(
                (conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=16, bias=False)
                (bn): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (pointwise_conv): ConvModule(
                (conv): Conv2d(16, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
                (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
            )
          )
          (conv2): Sequential(
            (0): ConvModule(
              (conv): Conv2d(96, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
          )
          (act): ReLU()
        )
        (1): GELayer(
          (conv1): ConvModule(
            (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (activate): ReLU(inplace=True)
          )
          (dwconv): Sequential(
            (0): ConvModule(
              (conv): Conv2d(32, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
              (bn): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (activate): ReLU(inplace=True)
            )
          )
          (conv2): Sequential(
            (0): ConvModule(
              (conv): Conv2d(192, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
          )
          (act): ReLU()
        )
      )

Context Embedding Block

The author will branch semantics to the last stage The last layer of is made up of GE layer Instead of CE layer, Its structure is shown in the figure (4)(b) Shown , Global average pooling and residual connection are used to efficiently encode global context information .

(stage4_CEBlock): CEBlock(
        (gap): Sequential(
          (0): AdaptiveAvgPool2d(output_size=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv_gap): ConvModule(
          (conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activate): ReLU(inplace=True)
        )
        (conv_last): ConvModule(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activate): ReLU(inplace=True)
        )
      )

Bilateral Guided Aggregation

Because the characteristics of detail branch and semantic branch are different , The detail branch extracts low-level Detail features , Semantic branches extract high-level Semantic features , Therefore, we cannot simply pass summation or concatenation The way to fuse the features extracted by the two branches , The author puts forward bilateral guided aggregation layer To fuse complementary information from two branches , Use the context information of semantic branches to guide the feature response of detail branches , Through the guidance of different scales , We can get the feature representation of different scales , Effectively encode multi-scale information . The specific structure is shown in the following figure

BGA Code

class BGALayer(BaseModule):
    """Bilateral Guided Aggregation Layer to fuse the complementary information
    from both Detail Branch and Semantic Branch.

    Args:
        out_channels (int): Number of output channels.
            Default: 128.
        align_corners (bool): align_corners argument of F.interpolate.
            Default: False.
        conv_cfg (dict | None): Config of conv layers.
            Default: None.
        norm_cfg (dict | None): Config of norm layers.
            Default: dict(type='BN').
        act_cfg (dict): Config of activation layers.
            Default: dict(type='ReLU').
        init_cfg (dict or list[dict], optional): Initialization config dict.
            Default: None.
    Returns:
        output (torch.Tensor): Output feature map for Segment heads.
    """

    def __init__(self,
                 out_channels=128,
                 align_corners=False,
                 conv_cfg=None,
                 norm_cfg=dict(type='BN'),
                 act_cfg=dict(type='ReLU'),
                 init_cfg=None):
        super(BGALayer, self).__init__(init_cfg=init_cfg)
        self.out_channels = out_channels
        self.align_corners = align_corners
        self.detail_dwconv = nn.Sequential(
            DepthwiseSeparableConvModule(
                in_channels=self.out_channels,
                out_channels=self.out_channels,
                kernel_size=3,
                stride=1,
                padding=1,
                dw_norm_cfg=norm_cfg,
                dw_act_cfg=None,
                pw_norm_cfg=None,
                pw_act_cfg=None,
            ))
        self.detail_down = nn.Sequential(
            ConvModule(
                in_channels=self.out_channels,
                out_channels=self.out_channels,
                kernel_size=3,
                stride=2,
                padding=1,
                bias=False,
                conv_cfg=conv_cfg,
                norm_cfg=norm_cfg,
                act_cfg=None),
            nn.AvgPool2d(kernel_size=3, stride=2, padding=1, ceil_mode=False))
        self.semantic_conv = nn.Sequential(
            ConvModule(
                in_channels=self.out_channels,
                out_channels=self.out_channels,
                kernel_size=3,
                stride=1,
                padding=1,
                bias=False,
                conv_cfg=conv_cfg,
                norm_cfg=norm_cfg,
                act_cfg=None))
        self.semantic_dwconv = nn.Sequential(
            DepthwiseSeparableConvModule(
                in_channels=self.out_channels,
                out_channels=self.out_channels,
                kernel_size=3,
                stride=1,
                padding=1,
                dw_norm_cfg=norm_cfg,
                dw_act_cfg=None,
                pw_norm_cfg=None,
                pw_act_cfg=None,
            ))
        self.conv = ConvModule(
            in_channels=self.out_channels,
            out_channels=self.out_channels,
            kernel_size=3,
            stride=1,
            padding=1,
            inplace=True,
            conv_cfg=conv_cfg,
            norm_cfg=norm_cfg,
            act_cfg=act_cfg,
        )

    def forward(self, x_d, x_s):  # (4,128,60,60),(4,128,15,15)
        detail_dwconv = self.detail_dwconv(x_d)  # (4,128,60,60)
        detail_down = self.detail_down(x_d)  # (4,128,15,15)
        semantic_conv = self.semantic_conv(x_s)  # (4,128,15,15)
        semantic_dwconv = self.semantic_dwconv(x_s)  # (4,128,15,15)
        semantic_conv = resize(
            input=semantic_conv,
            size=detail_dwconv.shape[2:],
            mode='bilinear',
            align_corners=self.align_corners)  # (4,128,60,60)
        fuse_1 = detail_dwconv * torch.sigmoid(semantic_conv)  # (4,128,60,60)
        fuse_2 = detail_down * torch.sigmoid(semantic_dwconv)  # (4,128,15,15)
        fuse_2 = resize(
            input=fuse_2,
            size=fuse_1.shape[2:],
            mode='bilinear',
            align_corners=self.align_corners)  # (4,128,60,60)
        output = self.conv(fuse_1 + fuse_2)  # (4,128,60,60)
        return output

Booster Training Strategy

In order to further improve the segmentation accuracy , The author puts forward a strategy of intensive training , It can enhance the feature representation in the training stage , It can be discarded directly in the reasoning stage , Therefore, it will not increase the reasoning speed of the model . Pictured (3) Shown , By dividing the auxiliary head Add to different positions of semantic branches , Additional supervision of the intermediate output of the model , It can improve the accuracy of the model .

Implementation process

Let's say MMSegmentation Medium bisenet v2 Implementation as an example , Review the specific implementation process

hypothesis batch_size=4, Input shape by (4, 3, 480, 480).

Detail Branch The output of is (4, 128, 60, 60)
Semantic Branch As shown in the table (1) Shown ,Stem Block The output of is (4, 16, 120, 120),S3 The output of is (4, 32, 60, 60),S4 The output of is (4, 64, 30, 30),S5 The output of includes the second GE Layer output (4, 128, 15, 15) And the last CE Layer output (4, 128, 15, 15). So the output of semantic branch is a list, contain 5 Outputs , Last CE The output of and the output of the detail branch enter into BGA layer , front 4 Outputs during training , As an auxiliary segmentation head The input of .
Bilateral Guided Aggregation The output of is (4, 128, 60, 60)