当前位置:网站首页>Convnext:a convnet for the 2020s - model Brief
Convnext:a convnet for the 2020s - model Brief
2022-07-27 22:56:00 【gongyuandaye】
One 、 Abstract

This paper is benchmarked to last year's best paper:Swin Transformer, In the same flops It has higher accuracy and reasoning speed , It draws lessons from all aspects Swin Design patterns and training skills ( Such as AdamW Optimizer ), And step by step Swin Our strategy is incorporated into resnet In the design of , The following figure clearly shows the changes in accuracy caused by step-by-step changes in the model :
Two 、 Model design
For the above road map, briefly explain the design idea .
2.1 stage ratio
VGG A structure that divides the backbone network into several network blocks is proposed , Each network block will feature map Down sampling to different sizes . When there are more layers of deep network blocks , The model performs better .resnet50 Yes 4 Different network blocks , The stacking times are (3,4,6,3), And in the Swin-T Medium is (2,2,6,2), The ratio is 1:1:3:1, The author uses this ratio in resnet On , Every stage The stack block The times are (3,3,9,3).
2.2 patchify stem
stay Swin Middle is the first right 224 Sample under the picture of size 4 times , It's using k4s4 Convolution kernel , Similarly, the author replaces this part resnet Medium stem:
stem = nn.Sequential(
nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
LayerNorm(dims[0], eps=1e-6, data_format="channels_first")
)
among dims[0] = 96, and Swin The same output dimension .[2.4]
2.3 patchify stem
ResNeXt Block convolution is used in , Grouping channels , Then convolute in groups , And ResNet Compared to the FLOPs And accuracy .
The author uses depthwise convolution, It is a special form of block convolution , namely group Sum of numbers channel The same number , stay DwConv in , Each convolution kernel has channel All are equal to 1 Of , Each convolution kernel is only responsible for one of the input characteristic matrices channel, Therefore, the number of convolution kernels must be equal to that of the input characteristic matrix channel Count , So that the output characteristic matrix channel The number is also equal to... Of the input characteristic matrix channel Count .
This is because depthwise convolution It is very similar to the weighted sum operation in self attention .
self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim) # depthwise conv
2.4 width ↑
The author changed the initial number of channels from 64 Adjust to 96 and Swin Agreement , The accuracy is up to 80.5%.
2.5 inverting dims

In the residual network , The bottleneck layer is small in the middle , Big structure at both ends , stay MobileNetV2 The structure of reverse bottleneck layer is adopted in (b), This can reduce the loss of information .
Also in Swin Medium mlp The layer has a similar structure , So the author also put ConvNeXt in .
To explore larger convolution kernels (7x7), The author will DwConv Layer up , Because in transformer in ,MSA The module is placed in MLP Before the module .
2.6 Large Kernel
At present, the mainstream approach is to stack small convolution cores instead of large convolution cores ( image VGG), In modern times gpu There is an efficient hardware implementation on .
The author tried different sizes DwConv Convolution kernel size , Include 3, 5, 7, 9, 11, I found that I got 7 The accuracy reaches saturation , and Swin Agreement .
2.7 Micro Design
(1) The function... Will be activated ReLu replaced GELU;
(2) Use fewer activation functions ;
(3) Use less normalization;
(4) take BN(Batch Normalization) Replace with LN(Layer Normalization)( rewrite );
class LayerNorm(nn.Module):
r""" LayerNorm that supports two data formats: channels_last (default) or channels_first. The ordering of the dimensions in the inputs. channels_last corresponds to inputs with shape (batch_size, height, width, channels) while channels_first corresponds to inputs with shape (batch_size, channels, height, width). """
def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
super().__init__()
self.weight = nn.Parameter(torch.ones(normalized_shape))
self.bias = nn.Parameter(torch.zeros(normalized_shape))
self.eps = eps
self.data_format = data_format
if self.data_format not in ["channels_last", "channels_first"]:
raise NotImplementedError
self.normalized_shape = (normalized_shape, )
def forward(self, x):
if self.data_format == "channels_last":
return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
elif self.data_format == "channels_first":
u = x.mean(1, keepdim=True)
s = (x - u).pow(2).mean(1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.eps)
x = self.weight[:, None, None] * x + self.bias[:, None, None]
return x
(5) Split the lower sampling layer
stay resnet in , It usually uses the main branch k3s2 Convolution 、identity Branch k1s2 Convolution for down sampling .
stay Swin Transformer Separate the lower sampling layer from other operations , That is to use a k2s2 The convolution of is inserted into different stage Between .
ConvNeXt This strategy is also adopted , Research shows that the normalization layer is added where the resolution is changed , Can help stabilize training , Also before sampling 、stem after 、 After the global mean is pooled, they are added LN.
The comparison is as follows :

block Implementation code :
class Block(nn.Module):
r""" ConvNeXt Block. There are two equivalent implementations: (1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W) (2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back We use (2) as we find it slightly faster in PyTorch Args: dim (int): Number of input channels. drop_path (float): Stochastic depth rate. Default: 0.0 layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6. """
def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6):
super().__init__()
self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim) # depthwise conv
self.norm = LayerNorm(dim, eps=1e-6)
self.pwconv1 = nn.Linear(dim, 4 * dim) # pointwise/1x1 convs, implemented with linear layers
self.act = nn.GELU()
self.pwconv2 = nn.Linear(4 * dim, dim)
self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim)),
requires_grad=True) if layer_scale_init_value > 0 else None
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x):
input = x
x = self.dwconv(x)
x = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)
x = self.norm(x)
x = self.pwconv1(x)
x = self.act(x)
x = self.pwconv2(x)
if self.gamma is not None:
x = self.gamma * x
x = x.permute(0, 3, 1, 2) # (N, H, W, C) -> (N, C, H, W)
x = input + self.drop_path(x)
return x
3、 ... and 、 summary
In addition to the above skills, the author also uses some other methods , The details can be seen from the official Warehouse code .
The following is a complete structure comparison diagram , omitted LN And down sampling :

边栏推荐
- 组件的传参
- Three consecutive high-frequency interview questions of redis online celebrity: cache penetration? Cache breakdown? Cache avalanche?
- 2022/3/22考试总结
- Can uniswap integrate sudoswap to open a new prelude to NFT liquidity?
- 4 轮拿下字节 Offer,面试题复盘
- Purple light FPGA solves the mask problem! Boost the overall speed-up of mask production
- 2022/3/11 考试总结
- Exam summary on May 13, 2022
- MySQL的B+Tree索引到底是咋回事?聚簇索引到底是如何长高的?
- Brief explanation of noi 2018
猜你喜欢
随机推荐
Quartus:Instantiation of ‘sdram_model_plus‘ failed. The design unit was not found.
Data warehouse project is never a technical project
An article to solve the bigkey problem in redis
Eight years of love between me and the message queue
组件的传参
[noi2018] return (Kruskal reconstruction tree / persistent and search set)
2022年软件开发的趋势
51单片机内部外设:实时时钟(SPI)
Here comes Gree mask! Kn95 mask only costs 5.5 yuan!
If there is no reference ground at all, guess if you can control the impedance?
[cloud native] deploy redis cluster in k8s
我与消息队列的八年情缘
Quartus:Instantiation of ‘sdram_ model_ plus‘ failed. The design unit was not found.
NOI 2018 简要题解
Jumpserver learning
Feed stream application reconfiguration - Architecture
Possible causes of index failure
2022/4/8 exam summary
[NOI2018]归程(Kruskal重构树/可持久化并查集)
Setcontentview details









