当前位置：网站首页>Convnext:a convnet for the 2020s - model Brief

Convnext:a convnet for the 2020s - model Brief

2022-07-27 22:56:00 【gongyuandaye】

One 、 Abstract

Insert picture description here
This paper is benchmarked to last year's best paper：Swin Transformer, In the same flops It has higher accuracy and reasoning speed , It draws lessons from all aspects Swin Design patterns and training skills （ Such as AdamW Optimizer ）, And step by step Swin Our strategy is incorporated into resnet In the design of , The following figure clearly shows the changes in accuracy caused by step-by-step changes in the model ：
Insert picture description here

Two 、 Model design

For the above road map, briefly explain the design idea .

2.1 stage ratio

VGG A structure that divides the backbone network into several network blocks is proposed , Each network block will feature map Down sampling to different sizes . When there are more layers of deep network blocks , The model performs better .resnet50 Yes 4 Different network blocks , The stacking times are （3,4,6,3）, And in the Swin-T Medium is （2,2,6,2）, The ratio is 1：1：3：1, The author uses this ratio in resnet On , Every stage The stack block The times are （3,3,9,3）.

2.2 patchify stem

stay Swin Middle is the first right 224 Sample under the picture of size 4 times , It's using k4s4 Convolution kernel , Similarly, the author replaces this part resnet Medium stem：

stem = nn.Sequential(
            nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
            LayerNorm(dims[0], eps=1e-6, data_format="channels_first")
        )

among dims[0] = 96, and Swin The same output dimension .[2.4]

2.3 patchify stem

ResNeXt Block convolution is used in , Grouping channels , Then convolute in groups , And ResNet Compared to the FLOPs And accuracy .

The author uses depthwise convolution, It is a special form of block convolution , namely group Sum of numbers channel The same number , stay DwConv in , Each convolution kernel has channel All are equal to 1 Of , Each convolution kernel is only responsible for one of the input characteristic matrices channel, Therefore, the number of convolution kernels must be equal to that of the input characteristic matrix channel Count , So that the output characteristic matrix channel The number is also equal to... Of the input characteristic matrix channel Count .

This is because depthwise convolution It is very similar to the weighted sum operation in self attention .

self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim) # depthwise conv

2.4 width ↑

The author changed the initial number of channels from 64 Adjust to 96 and Swin Agreement , The accuracy is up to 80.5%.

2.5 inverting dims

Insert picture description here
In the residual network , The bottleneck layer is small in the middle , Big structure at both ends , stay MobileNetV2 The structure of reverse bottleneck layer is adopted in (b), This can reduce the loss of information .

Also in Swin Medium mlp The layer has a similar structure , So the author also put ConvNeXt in .

To explore larger convolution kernels （7x7）, The author will DwConv Layer up , Because in transformer in ,MSA The module is placed in MLP Before the module .

2.6 Large Kernel

At present, the mainstream approach is to stack small convolution cores instead of large convolution cores （ image VGG）, In modern times gpu There is an efficient hardware implementation on .

The author tried different sizes DwConv Convolution kernel size , Include 3, 5, 7, 9, 11, I found that I got 7 The accuracy reaches saturation , and Swin Agreement .

2.7 Micro Design

（1） The function... Will be activated ReLu replaced GELU;
（2） Use fewer activation functions ;
（3） Use less normalization;
（4） take BN（Batch Normalization） Replace with LN（Layer Normalization）（ rewrite ）;

class LayerNorm(nn.Module):
    r""" LayerNorm that supports two data formats: channels_last (default) or channels_first. The ordering of the dimensions in the inputs. channels_last corresponds to inputs with shape (batch_size, height, width, channels) while channels_first corresponds to inputs with shape (batch_size, channels, height, width). """
    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape))
        self.bias = nn.Parameter(torch.zeros(normalized_shape))
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise NotImplementedError 
        self.normalized_shape = (normalized_shape, )
    
    def forward(self, x):
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            u = x.mean(1, keepdim=True)
            s = (x - u).pow(2).mean(1, keepdim=True)
            x = (x - u) / torch.sqrt(s + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x

（5） Split the lower sampling layer
stay resnet in , It usually uses the main branch k3s2 Convolution 、identity Branch k1s2 Convolution for down sampling .

stay Swin Transformer Separate the lower sampling layer from other operations , That is to use a k2s2 The convolution of is inserted into different stage Between .

ConvNeXt This strategy is also adopted , Research shows that the normalization layer is added where the resolution is changed , Can help stabilize training , Also before sampling 、stem after 、 After the global mean is pooled, they are added LN.

The comparison is as follows ：

Insert picture description here
block Implementation code ：

class Block(nn.Module):
    r""" ConvNeXt Block. There are two equivalent implementations: (1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W) (2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back We use (2) as we find it slightly faster in PyTorch Args: dim (int): Number of input channels. drop_path (float): Stochastic depth rate. Default: 0.0 layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6. """
    def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim) # depthwise conv
        self.norm = LayerNorm(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, 4 * dim) # pointwise/1x1 convs, implemented with linear layers
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)
        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim)), 
                                    requires_grad=True) if layer_scale_init_value > 0 else None
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()

    def forward(self, x):
        input = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        if self.gamma is not None:
            x = self.gamma * x
        x = x.permute(0, 3, 1, 2) # (N, H, W, C) -> (N, C, H, W)

        x = input + self.drop_path(x)
        return x