当前位置:网站首页>Yolov5-6.0 series | yolov5 module design

Yolov5-6.0 series | yolov5 module design

2022-06-09 05:29:00 Clichong


If there is a mistake , Please point out .



This note is used to record yolov5 Some network modules in , Mainly in the common.py And experimental.py Of these two parts . Some lightweight network modules are designed , By the way, I will also introduce the principle . Besides , Also design some processing skills for other tasks , Including supersession processing ,SPPF Design , Weighted feature fusion . hold common.py And experimental.py The two parts I think are ingenious and attractive to me are summarized here .


1. Supersession (Focus)

 Insert picture description here

Design thinking :

For an ultra-high resolution image , In theory, it can periodically extract pixels and reconstruct them into low resolution images . Although we can also use interpolation and other mathematical methods to compress ultra-high resolution images directly , But there is no doubt that some image information will be lost . And in the Focus Module , A super-high resolution image can be reconstructed by periodically extracting pixels 4 A low resolution image , That is to stack the four adjacent positions of the image , focusing wh Dimension information to c Channel empty , Improve the receptive field of each point , And reduce the loss of original information . This is done to reduce the amount of computation , Add the calculation speed , Instead of increasing the accuracy of the network .

Code implementation :

class Focus(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
        super(Focus, self).__init__()
         # concat Convolution after ( The final convolution )
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act) 

    def forward(self, x):
        # x(b,c,w,h) -> y(b,4c,w/2,h/2) 
        image = torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], 
                          x[..., ::2, 1::2], x[..., 1::2, 1::2]], 
                         dim=1)
        return self.conv(image)

2. Bottleneck

This is a yolo series backbone The essence of ,yolov5 There are two different versions of :

2.1 BottleneckCSP

  • The first one is :BottleneckCSP

 Insert picture description here

yolov5 Code :

class BottleneckCSP(nn.Module):
    # CSP Bottleneck https://github.com/WongKinYiu/CrossStagePartialNetworks
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = nn.Conv2d(c1, c_, 1, 1, bias=False)
        self.cv3 = nn.Conv2d(c_, c_, 1, 1, bias=False)
        self.cv4 = Conv(2 * c_, c2, 1, 1)
        self.bn = nn.BatchNorm2d(2 * c_)  # applied to cat(cv2, cv3)
        self.act = nn.LeakyReLU(0.1, inplace=True)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])

    def forward(self, x):
        y1 = self.cv3(self.m(self.cv1(x)))
        y2 = self.cv2(x)
        return self.cv4(self.act(self.bn(torch.cat((y1, y2), dim=1))))

2.2 C3

  • The second kind :C3, There are only 3 A convolution , Use the name C3, Compared with the CSPBottleneck,C3 It's simpler , faster , And lighter .

 Insert picture description here

yolov5 Code :

class C3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # act=FReLU(c2)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
        # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))

2.3 GhostBottleneck

  • The third kind of :GhostBottleneck

paper:GhostNet: More Features from Cheap Operations

here yolov5 It's also achieved GhostConv, This is a lightweight network structure . The main idea is that partial convolution is unnecessary , It can be replaced by some linear operations . This is because the author will ResNet50 Of the first residual group feature map Visualizing , I found three pairs in it feature map, Then think of these feature map There is redundancy between pairs ( dependent ).

 Insert picture description here

Consider these feature map Redundant information in the layer may be an important part of a successful model , It is precisely because of this redundant information that the input data can be fully understood , Therefore, the author did not try to remove these redundancies when designing the lightweight model feature map, Instead, try to use lower cost computation to obtain these redundancy feature map. That is to say, it is common to obtain these similar redundant characteristic graphs through linear change .

say concretely , The ordinary convolution layer in the deep neural network is divided into two parts . The first part deals with ordinary convolution , But their total number will be strictly controlled . Given the eigengraph of the first part , Then a series of simple linear operations are applied to generate more feature graphs .

GhostConv Sketch Map :
 Insert picture description here

yolov5 Code :

class GhostConv(nn.Module):
    # Ghost Convolution https://github.com/huawei-noah/ghostnet
    def __init__(self, c1, c2, k=1, s=1, g=1, act=True):  # ch_in, ch_out, kernel, stride, groups
        super().__init__()
        c_ = c2 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, k, s, None, g, act)
        self.cv2 = Conv(c_, c_, 5, 1, None, c_, act)

    def forward(self, x):
        y = self.cv1(x)
        return torch.cat([y, self.cv2(y)], 1)
  • GhostBottleneck Sketch Map :

 Insert picture description here

yolov5 Code :

class GhostBottleneck(nn.Module):
    # Ghost Bottleneck https://github.com/huawei-noah/ghostnet
    def __init__(self, c1, c2, k=3, s=1):  # ch_in, ch_out, kernel, stride
        super().__init__()
        c_ = c2 // 2
        self.conv = nn.Sequential(GhostConv(c1, c_, 1, 1),  # pw
                                  DWConv(c_, c_, k, s, act=False) if s == 2 else nn.Identity(),  # dw
                                  GhostConv(c_, c2, 1, 1, act=False))  # pw-linear
        self.shortcut = nn.Sequential(DWConv(c1, c1, k, s, act=False),
                                      Conv(c1, c2, 1, 1, act=False)) if s == 2 else nn.Identity()

    def forward(self, x):
        return self.conv(x) + self.shortcut(x)

Brief analysis : You can find , In fact, the code is different from the original paper . there shortcut Is another branch to handle , Instead of simply adding the residuals .

2.4 TransformerBlock

  • A fourth :TransformerBlock

About transformer Specific introduction of , You can refer to the previous two notes :

1. Learning notes ——Transformer A complete introduction to the structure
2. Vision Transformer: Notes summary and pytorch Realization

Structural sketch :
 Insert picture description here

stay yolov5 In the implementation code of , Some changes have also been made , It removes norm The operation of ,MLP The activation function is not used in the module , Instead, it operates directly on two fully connected layers . and , there qkv All use Linear Operation , The whole architecture is very concise , The code is as follows :

yolov5 Code :

class TransformerLayer(nn.Module):
    # Transformer layer https://arxiv.org/abs/2010.11929 (LayerNorm layers removed for better performance)
    def __init__(self, c, num_heads):
        super().__init__()
        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)

    def forward(self, x):
        #  obtain qkv Then, the multi head self attention mechanism is operated 
        x = self.ma(self.q(x), self.k(x), self.v(x))[0] + x
        x = self.fc2(self.fc1(x)) + x
        return x


class TransformerBlock(nn.Module):
    # Vision Transformer https://arxiv.org/abs/2010.11929
    def __init__(self, c1, c2, num_heads, num_layers):
        super().__init__()
        self.conv = None
        if c1 != c2:
            self.conv = Conv(c1, c2)
        self.linear = nn.Linear(c2, c2)  # learnable position embedding
        self.tr = nn.Sequential(*[TransformerLayer(c2, num_heads) for _ in range(num_layers)])
        self.c2 = c2

    def forward(self, x):
        if self.conv is not None:
            x = self.conv(x)
        b, _, w, h = x.shape
        # (b,c,h,w) -> (b,c,hw) -> (1,b,c,hw) -> (hw,b,c,1) -> (hw,b,c)
        p = x.flatten(2).unsqueeze(0).transpose(0, 3).squeeze(3)

        # TransformerLayer Do not change the dimension of the feature sequence ,  remain unchanged 
        # (hw,b,c) -> (hw,b,c) -> (hw,b,c,1) -> (1,b,c,hw) -> (b,c,h,w)
        return self.tr(p + self.linear(p)).unsqueeze(3).transpose(0, 3).reshape(b, self.c2, w, h)
        
class C3TR(C3):
    # C3 module with TransformerBlock()
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__(c1, c2, n, shortcut, g, e)
        c_ = int(c2 * e)
        #  Here is simply to Bottleneck Replace module with TransformerBlock that will do 
        self.m = TransformerBlock(c_, c_, 4, n)

3. Neck

Neck Its function is to fuse multiple features with different resolutions , Get more information . Although it has been introduced before , But because of yolov5 The code is well written , Stick it here

3.1 SPP

  • SPP: parallel processing

 Insert picture description here

yolov5 Code :

class SPP(nn.Module):
    # Spatial Pyramid Pooling (SPP) layer https://arxiv.org/abs/1406.4729
    def __init__(self, c1, c2, k=(5, 9, 13)):
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            return self.cv2(torch.cat([x] + [m(x) for m in self.m], 1))

3.2 SPPF

  • SPPF: Serial processing ( Faster under the same performance )

 Insert picture description here

yolov5 Code :

class SPPF(nn.Module):
    # Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
    def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            y1 = self.m(x)
            y2 = self.m(y1)
            return self.cv2(torch.cat([x, y1, y2, self.m(y2)], 1))

4. Dimensional change

yolov5 Two functions for dimension change are provided Contract、Expand.Contract Function that changes the input characteristics shape, take feature map Of w and h dimension ( narrow ) Our data shrinks to channel Dimensionally ( Zoom in ). Such as :x(1,64,80,80) to x(1,256,40,40).Expand Functions also change the input characteristics shape, But with the Contract The opposite of , Yes, it will channel dimension ( smaller ) The data is extended to W and H dimension ( Bigger ). Such as :x(1,64,80,80) to x(1,16,160,160).

The dimension change posted here is to pay attention to , The change time is different from the ordinary change , See the following for details yolov5 Code for :

class Contract(nn.Module):
    # Contract width-height into channels, i.e. x(1,64,80,80) to x(1,256,40,40)
    def __init__(self, gain=2):
        super().__init__()
        self.gain = gain

    def forward(self, x):
        b, c, h, w = x.size()  # assert (h / s == 0) and (W / s == 0), 'Indivisible gain'
        s = self.gain
        x = x.view(b, c, h // s, s, w // s, s)  # x(1,64,40,2,40,2)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()  # x(1,2,2,64,40,40)
        return x.view(b, c * s * s, h // s, w // s)  # x(1,256,40,40)


class Expand(nn.Module):
    # Expand channels into width-height, i.e. x(1,64,80,80) to x(1,16,160,160)
    def __init__(self, gain=2):
        super().__init__()
        self.gain = gain

    def forward(self, x):
        b, c, h, w = x.size()  # assert C / s ** 2 == 0, 'Indivisible gain'
        s = self.gain
        x = x.view(b, s, s, c // s ** 2, h, w)  # x(1,2,2,16,80,80)
        x = x.permute(0, 3, 4, 1, 5, 2).contiguous()  # x(1,16,80,2,80,2)
        return x.view(b, c // s ** 2, h * s, w * s)  # x(1,16,160,160)

Brief analysis : Can see , In terms of code implementation, whether it is Contract still Expand, They are all right first wh After the split , Then change the dimension , Yes c To deal with . Then you need to serialize the memory . In this way, we can avoid the destruction of data


5. Secondary classification

yolov5 In the command.py At the end of this paper, a two-level classification module is provided , First of all, let's introduce the concept of secondary classification ?

What is a secondary classification module ? For example, license plate recognition , First identify the license plate , If you want to recognize the words on the license plate , Second level classification is needed for further detection . If the output of the model is classified again , You can use this module . But the class here is relatively simple , If a complex secondary classification , You can rewrite it according to your actual task , The code here is not unique . Here's the function and torch_utils.py Medium load_classifier Functions are similar .

The essential realization idea is :(b,c1,w,h) -> (b, c2)

Reference code :

class Classify(nn.Module):
    # Classification head, i.e. x(b,c1,20,20) to x(b,c2)
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.aap = nn.AdaptiveAvgPool2d(1)  # to x(b,c1,1,1)
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g)  # to x(b,c2,1,1)
        self.flat = nn.Flatten()

    def forward(self, x):
        z = torch.cat([self.aap(y) for y in (x if isinstance(x, list) else [x])], 1)  # cat if list
        return self.flat(self.conv(z))  # flatten to x(b,c2)

analysis : In fact, the implementation is a global pooling , Convolution and then flattening .


6. Other Conv

stay yolov5 The structure part of the code also provides some convolution modules that I haven't seen before , By the way, record here .

6.1 MixConv2d

paper:(MixConv: Mixed Depthwise Convolutional Kernels)[https://arxiv.org/abs/1907.09595]
thought : Put the channel Partition and then use different convolution kernels for convolution splicing

 Insert picture description here

The starting point is : As the convolution kernel increases , The ability of feature extraction will increase , But when it exceeds a certain value, it will decrease , And point out the following 3 Find out .

  1. Large kernel convolution is good at extracting large receptive field features , Small kernel convolution is good at extracting small receptive field features
  2. The characteristics of high receptive field and small receptive field are complementary , It does not mean that the characteristics of large receptive field must be better than those of small receptive field ( We can consider the extreme case that the convolution kernel is the same size as the characteristic graph )
  3. If you can mix the characteristics of different receptive fields , It will be helpful to improve the ability of feature extraction

But the characteristics of mixing different receptive fields are actually Inception I have used it for a long time , But the amount of calculation is a little large , Therefore, a slightly lightweight receptive field fusion method is proposed here , In part, the idea of separable convolution is used for reference , Put the channel Partition and then use different convolution kernels for convolution splicing .

yolov5 Code :

class MixConv2d(nn.Module):
    # Mixed Depth-wise Conv https://arxiv.org/abs/1907.09595
    def __init__(self, c1, c2, k=(1, 3), s=1, equal_ch=False):
        super().__init__()
        groups = len(k)
        #  Uniform channel 
        if equal_ch:  # equal c_ per group
            # floor Is to remove the decimal point 
            i = torch.linspace(0, groups - 1E-6, c2).floor()  # c2 indices
            c_ = [(i == g).sum() for g in range(groups)]  # intermediate channels

        #  Exponentially decreasing channel 
        else:  # equal weight.numel() per group
            b = [c2] + [0] * groups
            a = np.eye(groups + 1, groups, k=-1)
            a -= np.roll(a, 1, axis=1)
            a *= np.array(k) ** 2
            a[0] = 1
            c_ = np.linalg.lstsq(a, b, rcond=None)[0].round()  # solve for equal weight indices, ax = b

        self.m = nn.ModuleList([nn.Conv2d(c1, int(c_[g]), k[g], s, k[g] // 2, bias=False) for g in range(groups)])
        self.bn = nn.BatchNorm2d(c2)
        self.act = nn.LeakyReLU(0.1, inplace=True)

    def forward(self, x):
        #  Although the thought is to different channel For different kernel size Convolution operation of 
        #  But the implementation is still right for all channel For different kernel size The convolution output is different channels Put them together again ,  No points channel Convolution implementation 
        return x + self.act(self.bn(torch.cat([m(x) for m in self.m], 1)))

analysis : Use to see , such yolov5 The same goes for the original MixConv2d Improved , By looking at its core self.m The structure will be known :

ModuleList(
  (0): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (1): Conv2d(32, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)

here yolov5 The channel Partition and then use different convolution kernels for convolution splicing , But for all channel Different convolution kernels are convoluted into different channel And then put them together .

6.2 CrossConv

For this module , In fact, I haven't found any introductions on the Internet ( If you find any friends, you can tell me in the comments )

Just look at the name , You can know the basic operation concept , It's Cross convolution , In fact, it can also be regarded as cross convolution . For ordinary 3x3 Convolution can actually be seen as a local sliding window , And here it is Cross It refers to the cross shape to perform convolution operation . Specifically, it realizes , Is to use a (1, 3) Convolution kernel convolution , Use one more (3, 1) Convolution kernel convolution .

Once I saw an article about transformer The article , note : Paper reading notes | Transformer series ——CSWin Transformer. This article paper It is proposed that cross windons In order to calculate the self attention in the vertical and horizontal of the cross window in parallel , Each strip is obtained by dividing the input feature into equal width strips . This improves the receptive field of a single-layer network , At the same time, the calculation parameters are reduced .

Come back and have a look yolov5 Function code for , Personally, I think it can be understood as a reference CSWin Transformer An idea of :

class CrossConv(nn.Module):
    # Cross Convolution Downsample
    def __init__(self, c1, c2, k=3, s=1, g=1, e=1.0, shortcut=False):
        # ch_in, ch_out, kernel, stride, groups, expansion, shortcut
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, (1, k), (1, s))
        self.cv2 = Conv(c_, c2, (k, 1), (s, 1), g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

Personal understanding , In the field of video understanding , because 3D The parameters of convolution are huge , Usually, space is convoluted first and then time is convoluted , In essence, the method of this operation is to reduce the number of parameters and floating-point operations without reducing the accuracy .


7. Weighted feature fusion

thought : Traditional feature fusion is often simple feature map superposition / Add up (sum them up), For example, use concat perhaps shortcut Connect , Instead of adding at the same time feature map Distinguish . However , Different inputs feature map With different resolutions , They are input to the fusion feature map Their contributions are also different , Therefore, simply adding or stacking them is not the best operation , So here we propose a simple and efficient weighted feature fusion mechanism .

yolov5 Code :

class Sum(nn.Module):
    # Weighted sum of 2 or more layers https://arxiv.org/abs/1911.09070
    def __init__(self, n, weight=False):  # n: number of inputs
        super().__init__()
        self.weight = weight  # apply weights boolean
        self.iter = range(n - 1)  # iter object

        #  in the light of n Layer feature layer construction n-1 A learnable parameter 
        if weight:
            self.w = nn.Parameter(-torch.arange(1., n) / 2, requires_grad=True)  # layer weights

    def forward(self, x):
        y = x[0]  # no weight
        if self.weight:
            w = torch.sigmoid(self.w) * 2
            #  Weighted sum of feature weights 
            # y = x[0] + x[1]*w0 + x[2]*w1 + x[3]*w2
            for i in self.iter:
                y = y + x[i + 1] * w[i]
        else:
            #  Common feature fusion ,  No weight distinction 
            # y = x[0] + x[1] + x[2] + x[3]
            for i in self.iter:
                y = y + x[i + 1]
        return y
        
#  Test code 
if __name__ == '__main__':
    x = torch.rand([2, 32, 8, 8])
    p = [x, x, x, x]
    bottleneck = Sum(n=4, weight=True)
    print(bottleneck(p).shape)

8. Ensemble The integration algorithm

see :YOLOv5 Of Tricks | 【Trick2】 Multi model inference and prediction in target detection (Model Ensemble)


Reference material :

1. GhostNet_GhostModule(2020)

2. GhostNet

3. Learning notes ——Transformer A complete introduction to the structure

4. Vision Transformer: Notes summary and pytorch Realization

5. MixConv: The depth of the mixed receptive field can be separated and convoluted

6. 【YOLOV5-5.x Source code interpretation 】common.py

7. 【YOLOV5-5.x Source code interpretation 】experimental.py

8. YOLOv5 Of Tricks | 【Trick2】 Multi model inference and prediction in target detection (Model Ensemble)

原网站

版权声明
本文为[Clichong]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/160/202206090518244709.html