当前位置：网站首页>Yolov5 lite: how to speed up the Yolo model on CPU?

Yolov5 lite: how to speed up the Yolo model on CPU?

2022-07-08 02:19:00 【pogg_】

Insert picture description here

QQ Communication group ：993965802

The copyright of this article belongs to GiantPandaCV, Please do not reprint without permission

** Preface ：** This is an argumentation blog , the other day ,baidu Released PPLcnet, This is a special cpu Designed network , After reading the paper , Decisively PPLcnet-yolo The recurrence of , First, I want to verify that this network is cpu Performance on , Second, if the verification effect work, This set of experiments can be merged into your own warehouse .

One 、PPLcnet performance ：

When reading the paper , The biggest temptation to me is the following one benckmark Comparison .

Insert picture description here

Actually before , Have you tried to use it mobilenetv2、mobilenetv3 experiment , But the effect didn't make me feel ideal , The reason is simple , stay arm Architecturally ,mb The series is shuffle The series is unique , This advantage is not reflected in accuracy , in fact , Their accuracy will not exceed 3 percentage . But I think the side end falls to the ground , Speed and memory usage are the two most critical factors （ The premise is that the accuracy is within the acceptable range ）, So don't hesitate to use shufflenetv2 To be the backbone .

Of course , It can't be used for doctrine , We need to analyze some advantages and disadvantages before making a choice , For example, for yolov5s Of head, If directly grafted , Some channels will be redundant , This is not just reflected in model parameters , You can see in many ways .

Use model pruning to approximate channel Maximum bearing capacity , Carry out experiments to verify the effect , This is also a semi violent solution , It can save a lot of invalid time .

On the other hand ,shufflenetv2-yolov5 Two of the models branch Branches use a lot bn layer , At deployment fuse, The speed can be increased 15%（ This code will be merged after the thesis defense ）.

stay GPU Architecturally ,Repvgg-yolov5 So it is with , The head becomes thicker and narrower , Mainly to reduce the parameters and the amount of calculation （C3 The credit of structure ）, The trunk is replaced by repvgg, Multi branch feature extraction is adopted in training , When deployed, it is re parameterized into a straight pipe network , Can speed up 20%. The parameters and calculation amount are reduced respectively 35% and 10%, In precision ,Repvgg-yolov5 Of [email protected] Promoted 1.1,[email protected]:0.95 Promoted 2.2, But the price is forward reasoning than the original yolov5s It will cost more 1ms（ Test video card as 2080Ti）.

Sum up , Please ask me to add debug Maniac , There is no innovation , But they are all practical models for industrial deployment .

stay cpu Architecturally , Before and after mbv2、mbv3 The experiment of , Accuracy is actually the same as shufflev2 Little difference , But the result is relative to yolov5s,input size=352*352,yolov5s The accuracy of is slightly higher than that of the modified model , There is no great advantage in speed .

later PPLcnet appear , I have a strong desire to try whether this network can help yolo stay cpu Up acceleration .

The structure of the model is roughly as follows ：

Insert picture description here

The most important component is the depth separable convolution , From the first floor CBH Start （conv+bn+hardswish）, It contains 13 layer dw, And then there's GAP Refer to 7*7 Of Global Average Pooling,GAP Add... To the back point conv+FC+hardswish Components , Finally, the output is 1000 Of FC layer , For more details, please check the paper ：

https%3A//arxiv.org/pdf/2109.15099.pdf

The whole paper can be summarized about PPLcnet Four important conclusions of ：

H-Swish With large convolution kernel, the performance of the model can be improved without causing large reasoning loss （ Look down Table9）;
Add a small amount of SE The module can further improve the performance of the model without excessive loss （ actually Lcnet This is just adding attention to the last two layers , But the improvement effect is obvious ）;

Insert picture description here

GAP After the adoption of larger FC Layer can greatly improve the performance of the model （ But it will also make the model parameters and calculation soar ）;
dropout Technology can further improve the accuracy of the model

Two 、PPLcnet-yolo：

The figure below shows the fusion PPLcnet Of YOLOv5, With the original Lcnet The difference is , The number of layers here has changed , More Than This ,YOLOv5s head Medium 3*3 Convolution is also replaced by Lc_Block, And used SE module, We do a layer by layer analysis ：
Insert picture description here
1. The number of layers changes

Pictured above ,CBH Double the number of channels , Two were taken out channel by 256 Of DSC 33 Convolution layer , Replace with two DSC 55 layer （ nothing SE Module）, And the last four DSC Layers contain SE modular , The total number of layers has only increased 3 layer ,SE Module From the original 2 The layer becomes 4 layer （ after 4 layer ）, But the accuracy is greatly improved , This is a reference shufflev2 Of 【2,4,8,4】 Even multiple layers , If you are interested, you can read this paper , It is of great engineering significance .
Insert picture description here
2. Dense Layer

Dense Layer In essence GAP+FC, It was found that , add to FC Layer accuracy can be improved 4 A p.m. , But it will cause the model parameters to soar , Affect the speed of reasoning , So I got rid of all FC layer , Just stay point conv and dropout：

class Dense(nn.Module):
    def __init__(self, c1, c2, filter_size, dropout_prob=0.2):
        super().__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.dense_conv = nn.Conv2d(
            in_channels=c1,
            out_channels=c2,
            kernel_size=filter_size,
            stride=1,
            padding=0,
            bias=False)
        self.hardswish = nn.Hardswish()
        self.dropout = nn.Dropout(p=dropout_prob)
        self.flatten = nn.Flatten(start_dim=1, end_dim=-1)
        self.fc = nn.Linear(num_filters, num_filters)
    
    def forward(self, x):
        x = self.avg_pool(x)
        b, _, w, h = x.shape
        x = self.dense_conv(x)
        x = self.hardswish(x)
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.fc(x)
        x = x.reshape(b, self.c2, w, h)
        return x

3. head
PPLcnet It has been verified to replace a small amount at the end 55 Convolution can play the role of rising point , Therefore, the original yolov5s head Of 33 Convolution is replaced by Lc Block, But because Lc Block The essence is deep separable convolution , Even if 55 Convolution kernel , Integrated SE module, The parameter quantity is still higher than the original 33 Convolution is less than half , The experiment found that it could rise , The amount of parameters produced is also very small , Personally, I think it's very cost-effective

# YOLOv5s head：
Model Summary: 297 layers, 4982390 parameters, 4982390 gradients, 9.4 GFLOPS
# YOLOv5s head with Lc_Block：
Model Summary: 307 layers, 4376531 parameters, 4376531 gradients, 8.6 GFLOPS
# YOLOv5s head with Lc_Block and SE Module：
Model Summary: 319 layers, 4378838 parameters, 4378838 gradients, 8.6 GFLOPS

There are also some small component changes , such as SE module Of Hard sigmoid Replace with Silu, It can increase a little and speed up （ This point follows v5 Great God walks ）, Another is to avoid onnx No, h-sigmoid This operator , You need to refactor the operator （ This reconstruction will cause a slight decrease in accuracy , So replacing the activation function is the most worry free work ）.

4. performance

The performance of the model after reproduction is as follows ：

Insert picture description here

stay [email protected] and [email protected]:0.95 Everything is better than the original yolov5s About three points less , The amount of parameters and calculation is about twice less .

Insert picture description here
However , None of the above is the point , I think the most important thing is performance , So using PPLcnet and yolov5s stay openvino To evaluate , The test hardware is Inter Core @i5-10210.

First extract onnx Model ：

$ python models/export.py --weights PPLcnet.pt --img 640 --batch 1
$ python -m onnxsim PPLcnet.onnx PPLcnet-sim.onnx

And then PPLcnet-sim.onnx Turn into IR Model ：

$ python mo.py --input_model PPLcnet-yolo.onnx -s 255 --data_type FP16  --reverse_input_channels --output Conv_462,Conv_478,Conv_494

Empathy ,yolov5s Is the same

$ python models/export.py --weights yolov5s.pt --img 640 --batch 1
$ python -m onnxsim yolov5s.onnx yolov5s-sim.onnx
$ python mo.py --input_model PPLcnet-yolo.onnx -s 255 --data_type FP16  --reverse_input_channels --output Conv_245,Conv_261,Conv_277

here , We can get four models ：
Insert picture description here
Model comparison ：

Then test , total 50 A picture ,For Cycle are 1000 Second forward reasoning , Calculate the average time per picture ：

The test shows that ,input size=640*640,PPLcnet A previous reasoning of than the original yolov5s fast 3 About times , Some sample views .

PPLcnet-yolo Forward Example:
Insert picture description here

YOLOv5s Forward Example:
Insert picture description here

Leaving a message. ：

Later, the reproduced experiments and codes will be merged into the main branch ：

https://github.com/ppogg/YOLOv5-Lite

Welcome to go whoring for nothing , If you have any questions, you can ask issue, It will be solved as soon as possible .

in addition , This is for cpu The designed model , Please use openvino Or other cpu Deploy and evaluate the forward reasoning framework ！！！

YOLOv5 6.0 Here comes the version

Here we go , Yesterday I saw YOLOv5 Released the Sixth Edition ：
Insert picture description here

The performance of the model has improved ：
Insert picture description here

There is still no innovation , But there is a breakthrough in engineering value , Embodied in computing resources and reasoning time .

in addition , I think there are three main highlights ,YOLOv5-Nano Adaptation to mobile devices ,Focus Layer changes ,SPP The changes to the ：

1. YOLOv5-Nano Performance of ：

Before, the device with focus Quantitative version of layer yolov5s Model , It's easy to find this thing collapsing , For small models ,v5 The great God is directly replaced , Maybe it's for stability , After all conv3*3 Convolution optimization on different frameworks has been very mature , For most small models , The parameters of its own model and the amount of calculation generated by runtime are not much , Use focus It is also difficult to play the role of reducing parameters and calculation amount , It can be more stable when quantifying .

2. Focus Layer changes ：
Insert picture description here

3. SPP→SPPF：
Insert picture description here
By the way！！！v4 Daniel ,v5 A great god , still more Scale Yolov4 The author of , Three people are called in the community commit Maniac , For a period of time, I saw them every day update, This craftsmanship is really admirable ..