当前位置：网站首页>Yolov5 Lite: yolov5, which is lighter, faster and easy to deploy

Yolov5 Lite: yolov5, which is lighter, faster and easy to deploy

2022-07-08 02:19:00 【pogg_】

QQ Communication group ：993965802

The copyright of this article belongs to GiantPandaCV, Please do not reprint without permission

Preface ： Part of Bi Shi , Some time ago , stay yolov5 A series of ablation experiments were carried out on , Make him lighter （Flops smaller , Lower memory usage , Fewer parameters ）, faster （ Join in shuffle channel,yolov5 head Perform channel clipping , stay 320 Of input_size At least in raspberry pie 4B Reasoning last second 10 frame ）, Easier to deploy （ Enucleation Focus Layer and four slice operation , Make the quantization accuracy of the model drop within an acceptable range ）.

2021-08-26 to update ---------------------------------------

We try to improve the resolution and then lower it for training （train in 640-1280-640, The first round 640 Training 150 individual epochs, The second round 1280 Training 50 individual epochs, The third round 640 Training 100 individual epochs）, It is found that the model still has a certain learning ability ,map Still improving

# evaluate in 640×640：
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.271
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.457
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.274

# evaluate in 416×416：
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.244
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.413
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.246

# evaluate in 320×320：
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.208
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.362
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.206

One 、 Comparison of ablation experimental results

ID	Model	Input_size	Flops	Params	Size（M）	$Map^{@0.5}$	$Map^{@0.5:0.95}$
001	yolo-faster	320×320	0.25G	0.35M	1.4	24.4	-
002	nanodet-m	320×320	0.72G	0.95M	1.8	-	20.6
003	yolo-faster-xl	320×320	0.72G	0.92M	3.5	34.3	-
004	yolov5-lite	320×320	1.43G	1.62M	3.3	36.2	20.8
005	yolov3-tiny	416×416	6.96G	6.06M	23.0	33.1	16.6
006	yolov4-tiny	416×416	5.62G	8.86M	33.7	40.2	21.7
007	nanodet-m	416×416	1.2G	0.95M	1.8	-	23.5
008	yolov5-lite	416×416	2.42G	1.62M	3.3	41.3	24.4
009	yolov5-lite	640×640	2.42G	1.62M	3.3	45.7	27.1
010	yolov5s	640×640	17.0G	7.3M	14.2	55.4	36.7

notes ：yolov5 primary FLOPS The calculation script has bug, Please use thop Library calculation ：

input = torch.randn(1, 3, 416, 416)
flops, params = thop.profile(model, inputs=(input,))
print('flops:', flops / 900000000*2)
print('params:', params)

Two 、 Detection effect

$Pytorch^{@640×640}：$

$NCNN^{@FP16}_{640\times640}$

$NCNN^{@Int8}_{640\times640}$

3、 ... and 、Relect Work

YOLOv5-Lite The network structure of is actually very simple ,backbone Mainly used are shuffle channel Of shuffle block, The head is still used yolov5 head, But it's a castrated version yolov5 head

shuffle block：

yolov5 head：

yolov5 backbone：

In the beginning U Version of yolov5 backbone in , The author used four times in the superstructure of feature extraction slice Operations consist of Focus layer

about Focus layer , Every... In a square 4 Adjacent pixels , And generate a with 4 Times the number of channels feature map, It is similar to four down sampling operations on the upper layer , Then the result concat together , The main function is not to reduce the ability of feature extraction , Reduce and accelerate the model .

1.7.0+cu101 cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)

      Params       FLOPS    forward (ms)   backward (ms)                   input                  output
        7040       23.07           62.89           87.79       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07           15.52           48.69       (16, 3, 640, 640)      (16, 64, 320, 320)
1.7.0+cu101 cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)

      Params       FLOPS    forward (ms)   backward (ms)                   input                  output
        7040       23.07           11.61           79.72       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07           12.54           42.94       (16, 3, 640, 640)      (16, 64, 320, 320)

As can be seen from the above figure ,Focus When the layer parameters are reduced , The model is accelerated .

but ！ This acceleration is conditional , Must be in GPU This advantage can only be reflected under the use of , For cloud deployment, this method ,GPU It is not necessary to consider the cache occupation , The method of "take and process" makes Focus Layer in GPU The equipment is very work.

For your chip , Especially without GPU、NPU Accelerated chips , Frequent slice The operation will only make the cache seriously occupied , Increase the burden of calculation and processing . meanwhile , When the chip is deployed ,Focus Layer transformation is extremely unfriendly to novices .

Four 、 Lightweight concept

shufflenetv2 Design concept of , On the chip side where resources are scarce , It has many reference meanings , It proposes four criteria for model lightweight ：

（G1） The same channel size minimizes memory access
（G2） Overuse of group convolution increases MAC
（G3） The network is too fragmented （ Especially multi-channel ） Will reduce parallelism
（G4） Element level operations cannot be ignored （ such as shortcut and Add）

YOLOv5-Lite

Design concept ：
（G1） Enucleation Focus layer , Avoid multiple use of slice operation

（G2） Avoid multiple use C3 Leyer And high channel C3 Layer

C3 Leyer yes YOLOv5 Proposed by the author CSPBottleneck Improved version , It's simpler 、 faster 、 Lighter , Better results can be achieved at nearly similar losses . but C3 Layer Using demultiplexing convolution , Test certificate , Frequent use C3 Layer And those with a high number of channels C3 Layer, Take up more cache space , Reduce the running speed .

（ Why the higher the number of channels C3 Layer Would be right cpu Not very friendly , Mainly because shufflenetv2 Of G1 Rules , The higher the number of channels ,hidden channels And c1、c2 The step gap is larger , An inappropriate metaphor , Imagine jumping one step and ten steps , Although you can jump ten steps at a time , But you need a run-up , adjustment , Only by accumulating strength can you jump onto , It may take longer ）

class C3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super(C3, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1) 
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
        # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])

（G3） Yes yolov5 head Conduct channel pruning , Pruning details reference G1

（G4） Enucleation shufflenetv2 backbone Of 1024 conv and 5×5 pooling

This is for imagenet A module designed to make a list , When there are not so many classes in the actual business scenario , It can be removed properly , Accuracy will not have much effect , But it's a big boost for speed , This was also confirmed in ablation experiments .

5、 ... and 、What can be used for？

（G1） Training

Isn't that bullshit ... It's really a little nonsense ,YOLOv5-Lite be based on yolov5 The fifth edition （ The latest version ） Ablation experiments performed on , So you can continue all the functions of the fifth edition without modification , such as ：

Export heat map ：

Derive confusion matrix for data analysis ：

export PR curve ：

（G2） export onnx No other modifications are required after （ For deployment ）

（G3）DNN or ort The call no longer requires additional pairs Focus Layer by layer （ Before playing yolov5 Stuck here for a long time , Although it can be called, the accuracy also drops a lot ）：

（G4）ncnn Conduct int8 Quantification can ensure the continuity of accuracy （ In the next chapter, I will talk about ）

（G5） stay 0.1T Play with the raspberry pie of Suanli yolov5 Can also be real-time

I used to run on raspberry pie yolov5, It's something you can't even think of , It is necessary to detect only one frame 1000ms about , Even 160*120 You need to input 200ms about , I can't chew it .

But now YOLOv5-Lite Did it , The detection scene is in the space like elevator car and corridor corner , The actual detection distance only needs to be guaranteed 3m that will do , Adjust the resolution to 160*120 Under the circumstances ,YOLOv5-Lite Up to 18 frame , In addition, post-processing can basically stabilize in 15 Around the frame .

Remove the first three preheating , The temperature of the equipment is stable at 45° above , The forward reasoning framework is ncnn, Record twice benchmark contrast ：

#  The fourth time 
[email protected]pi:~/Downloads/ncnn/build/benchmark $ ./benchncnn 8 4 0
loop_count = 8
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 1
         YOLOv5-Lite  min =   90.86  max =   93.53  avg =   91.56
    YOLOv5-Lite-int8  min =   83.15  max =   84.17  avg =   83.65
     YOLOv5-Lite-416  min =  154.51  max =  155.59  avg =  155.09
         yolov4-tiny  min =  298.94  max =  302.47  avg =  300.69
           nanodet_m  min =   86.19  max =  142.79  avg =   99.61
          squeezenet  min =   59.89  max =   60.75  avg =   60.41
     squeezenet_int8  min =   50.26  max =   51.31  avg =   50.75
           mobilenet  min =   73.52  max =   74.75  avg =   74.05
      mobilenet_int8  min =   40.48  max =   40.73  avg =   40.63
        mobilenet_v2  min =   72.87  max =   73.95  avg =   73.31
        mobilenet_v3  min =   57.90  max =   58.74  avg =   58.34
          shufflenet  min =   40.67  max =   41.53  avg =   41.15
       shufflenet_v2  min =   30.52  max =   31.29  avg =   30.88
             mnasnet  min =   62.37  max =   62.76  avg =   62.56
     proxylessnasnet  min =   62.83  max =   64.70  avg =   63.90
     efficientnet_b0  min =   94.83  max =   95.86  avg =   95.35
   efficientnetv2_b0  min =  103.83  max =  105.30  avg =  104.74
        regnety_400m  min =   76.88  max =   78.28  avg =   77.46
           blazeface  min =   13.99  max =   21.03  avg =   15.37
           googlenet  min =  144.73  max =  145.86  avg =  145.19
      googlenet_int8  min =  123.08  max =  124.83  avg =  123.96
            resnet18  min =  181.74  max =  183.07  avg =  182.37
       resnet18_int8  min =  103.28  max =  105.02  avg =  104.17
             alexnet  min =  162.79  max =  164.04  avg =  163.29
               vgg16  min =  867.76  max =  911.79  avg =  889.88
          vgg16_int8  min =  466.74  max =  469.51  avg =  468.15
            resnet50  min =  333.28  max =  338.97  avg =  335.71
       resnet50_int8  min =  239.71  max =  243.73  avg =  242.54
      squeezenet_ssd  min =  179.55  max =  181.33  avg =  180.74
 squeezenet_ssd_int8  min =  131.71  max =  133.34  avg =  132.54
       mobilenet_ssd  min =  151.74  max =  152.67  avg =  152.32
  mobilenet_ssd_int8  min =   85.51  max =   86.19  avg =   85.77
      mobilenet_yolo  min =  327.67  max =  332.85  avg =  330.36
  mobilenetv2_yolov3  min =  221.17  max =  224.84  avg =  222.60

#  The eighth 
[email protected]:~/Downloads/ncnn/build/benchmark $ ./benchncnn 8 4 0
loop_count = 8
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 1
           nanodet_m  min =   81.15  max =   81.71  avg =   81.33
       nanodet_m-416  min =  143.89  max =  145.06  avg =  144.67
         YOLOv5-Lite  min =   84.30  max =   86.34  avg =   85.79
    YOLOv5-Lite-int8  min =   80.98  max =   82.80  avg =   81.25
     YOLOv5-Lite-416  min =  142.75  max =  146.10  avg =  144.34
         yolov4-tiny  min =  276.09  max =  289.83  avg =  285.99
          squeezenet  min =   59.37  max =   61.19  avg =   60.35
     squeezenet_int8  min =   49.30  max =   49.66  avg =   49.43
           mobilenet  min =   72.40  max =   74.13  avg =   73.37
      mobilenet_int8  min =   39.92  max =   40.23  avg =   40.07
        mobilenet_v2  min =   71.57  max =   73.07  avg =   72.29
        mobilenet_v3  min =   54.75  max =   56.00  avg =   55.40
          shufflenet  min =   40.07  max =   41.13  avg =   40.58
       shufflenet_v2  min =   29.39  max =   30.25  avg =   29.86
             mnasnet  min =   59.54  max =   60.18  avg =   59.96
     proxylessnasnet  min =   61.06  max =   62.63  avg =   61.75
     efficientnet_b0  min =   91.86  max =   95.01  avg =   92.84
   efficientnetv2_b0  min =  101.03  max =  102.61  avg =  101.71
        regnety_400m  min =   76.75  max =   78.58  avg =   77.60
           blazeface  min =   13.18  max =   14.67  avg =   13.79
           googlenet  min =  136.56  max =  138.05  avg =  137.14
      googlenet_int8  min =  118.30  max =  120.17  avg =  119.23
            resnet18  min =  164.78  max =  166.80  avg =  165.70
       resnet18_int8  min =   98.58  max =   99.23  avg =   98.96
             alexnet  min =  155.06  max =  156.28  avg =  155.56
               vgg16  min =  817.64  max =  832.21  avg =  827.37
          vgg16_int8  min =  457.04  max =  465.19  avg =  460.64
            resnet50  min =  318.57  max =  323.19  avg =  320.06
       resnet50_int8  min =  237.46  max =  238.73  avg =  238.06
      squeezenet_ssd  min =  171.61  max =  173.21  avg =  172.10
 squeezenet_ssd_int8  min =  128.01  max =  129.58  avg =  128.84
       mobilenet_ssd  min =  145.60  max =  149.44  avg =  147.39
  mobilenet_ssd_int8  min =   82.86  max =   83.59  avg =   83.22
      mobilenet_yolo  min =  311.95  max =  374.33  avg =  330.15
  mobilenetv2_yolov3  min =  211.89  max =  286.28  avg =  228.01

（G6）YOLOv5-Lite And yolov5s Comparison of

notes ： Randomly select 100 pictures for reasoning , Round off to calculate the average time per .

Afterword ：

I have used my own data set before yolov3-tiny,yolov4-tiny,nanodet,efficientnet-lite And other lightweight Networks , But the results did not meet expectations , Instead, use yolov5 It has achieved better results than I expected , But it's true ,yolov5 Not in the lightweight network design concept , So, right yolov5 Modified idea, Hope to be able to enhance its strong data and positive and negative anchor Satisfactory results can be achieved under the mechanism . in general ,YOLOv5-Lite Based on yolov5 Platform for training , For a small sample data set, it is still very work Of .

There is not much complicated interleaved parallel structure , Try to ensure the simplicity of the network model ,YOLOv5-Lite Designed purely for industrial landing , Better fit Arm Architecture's processor , But you use this thing to run GPU, Low cost performance .

Optimize this thing , Part of it is based on feelings , After all, a lot of early work is based on yolov5 Carried out by , Part of it is true that this thing is very important for my personal data set work（ To be exact , It should be for me who is extremely short of data set resources ,yolov5 The various mechanisms of are indeed robust to small sample data sets ）.

Project address ：
https://github.com/ppogg/YOLOv5-Lite

in addition , This project will be continuously updated and iterated , welcome star and fork！

Finally, a digression , In fact, I have been paying attention to YOLOv5 The dynamics of the , lately U The update frequency of version great God is much faster , It is estimated that soon YOLOv5 Will usher in the Sixth Edition ~

Insert picture description here