当前位置:网站首页>Yolov5 lite: how to speed up the Yolo model on CPU?
Yolov5 lite: how to speed up the Yolo model on CPU?
2022-07-08 02:19:00 【pogg_】
QQ Communication group :993965802
The copyright of this article belongs to GiantPandaCV, Please do not reprint without permission
** Preface :** This is an argumentation blog , the other day ,baidu Released PPLcnet, This is a special cpu Designed network , After reading the paper , Decisively PPLcnet-yolo The recurrence of , First, I want to verify that this network is cpu Performance on , Second, if the verification effect work, This set of experiments can be merged into your own warehouse .
One 、PPLcnet performance :
When reading the paper , The biggest temptation to me is the following one benckmark Comparison .
Actually before , Have you tried to use it mobilenetv2、mobilenetv3 experiment , But the effect didn't make me feel ideal , The reason is simple , stay arm Architecturally ,mb The series is shuffle The series is unique , This advantage is not reflected in accuracy , in fact , Their accuracy will not exceed 3 percentage . But I think the side end falls to the ground , Speed and memory usage are the two most critical factors ( The premise is that the accuracy is within the acceptable range ), So don't hesitate to use shufflenetv2 To be the backbone .
Of course , It can't be used for doctrine , We need to analyze some advantages and disadvantages before making a choice , For example, for yolov5s Of head, If directly grafted , Some channels will be redundant , This is not just reflected in model parameters , You can see in many ways .
Use model pruning to approximate channel Maximum bearing capacity , Carry out experiments to verify the effect , This is also a semi violent solution , It can save a lot of invalid time .
On the other hand ,shufflenetv2-yolov5 Two of the models branch Branches use a lot bn layer , At deployment fuse, The speed can be increased 15%( This code will be merged after the thesis defense ).
stay GPU Architecturally ,Repvgg-yolov5 So it is with , The head becomes thicker and narrower , Mainly to reduce the parameters and the amount of calculation (C3 The credit of structure ), The trunk is replaced by repvgg, Multi branch feature extraction is adopted in training , When deployed, it is re parameterized into a straight pipe network , Can speed up 20%. The parameters and calculation amount are reduced respectively 35% and 10%, In precision ,Repvgg-yolov5 Of [email protected] Promoted 1.1,[email protected]:0.95 Promoted 2.2, But the price is forward reasoning than the original yolov5s It will cost more 1ms( Test video card as 2080Ti).
Sum up , Please ask me to add debug Maniac , There is no innovation , But they are all practical models for industrial deployment .
stay cpu Architecturally , Before and after mbv2、mbv3 The experiment of , Accuracy is actually the same as shufflev2 Little difference , But the result is relative to yolov5s,input size=352*352,yolov5s The accuracy of is slightly higher than that of the modified model , There is no great advantage in speed .
later PPLcnet appear , I have a strong desire to try whether this network can help yolo stay cpu Up acceleration .
The structure of the model is roughly as follows :
The most important component is the depth separable convolution , From the first floor CBH Start (conv+bn+hardswish), It contains 13 layer dw, And then there's GAP Refer to 7*7 Of Global Average Pooling,GAP Add... To the back point conv+FC+hardswish Components , Finally, the output is 1000 Of FC layer , For more details, please check the paper :
https%3A//arxiv.org/pdf/2109.15099.pdf
The whole paper can be summarized about PPLcnet Four important conclusions of :
- H-Swish With large convolution kernel, the performance of the model can be improved without causing large reasoning loss ( Look down Table9);
- Add a small amount of SE The module can further improve the performance of the model without excessive loss ( actually Lcnet This is just adding attention to the last two layers , But the improvement effect is obvious );
- GAP After the adoption of larger FC Layer can greatly improve the performance of the model ( But it will also make the model parameters and calculation soar );
- dropout Technology can further improve the accuracy of the model
Two 、PPLcnet-yolo:
The figure below shows the fusion PPLcnet Of YOLOv5, With the original Lcnet The difference is , The number of layers here has changed , More Than This ,YOLOv5s head Medium 3*3 Convolution is also replaced by Lc_Block, And used SE module, We do a layer by layer analysis :
1. The number of layers changes
Pictured above ,CBH Double the number of channels , Two were taken out channel by 256 Of DSC 33 Convolution layer , Replace with two DSC 55 layer ( nothing SE Module), And the last four DSC Layers contain SE modular , The total number of layers has only increased 3 layer ,SE Module From the original 2 The layer becomes 4 layer ( after 4 layer ), But the accuracy is greatly improved , This is a reference shufflev2 Of 【2,4,8,4】 Even multiple layers , If you are interested, you can read this paper , It is of great engineering significance .
2. Dense Layer
Dense Layer In essence GAP+FC, It was found that , add to FC Layer accuracy can be improved 4 A p.m. , But it will cause the model parameters to soar , Affect the speed of reasoning , So I got rid of all FC layer , Just stay point conv and dropout:
class Dense(nn.Module):
def __init__(self, c1, c2, filter_size, dropout_prob=0.2):
super().__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.dense_conv = nn.Conv2d(
in_channels=c1,
out_channels=c2,
kernel_size=filter_size,
stride=1,
padding=0,
bias=False)
self.hardswish = nn.Hardswish()
self.dropout = nn.Dropout(p=dropout_prob)
self.flatten = nn.Flatten(start_dim=1, end_dim=-1)
self.fc = nn.Linear(num_filters, num_filters)
def forward(self, x):
x = self.avg_pool(x)
b, _, w, h = x.shape
x = self.dense_conv(x)
x = self.hardswish(x)
x = self.dropout(x)
x = self.flatten(x)
x = self.fc(x)
x = x.reshape(b, self.c2, w, h)
return x
3. head
PPLcnet It has been verified to replace a small amount at the end 55 Convolution can play the role of rising point , Therefore, the original yolov5s head Of 33 Convolution is replaced by Lc Block, But because Lc Block The essence is deep separable convolution , Even if 55 Convolution kernel , Integrated SE module, The parameter quantity is still higher than the original 33 Convolution is less than half , The experiment found that it could rise , The amount of parameters produced is also very small , Personally, I think it's very cost-effective
# YOLOv5s head:
Model Summary: 297 layers, 4982390 parameters, 4982390 gradients, 9.4 GFLOPS
# YOLOv5s head with Lc_Block:
Model Summary: 307 layers, 4376531 parameters, 4376531 gradients, 8.6 GFLOPS
# YOLOv5s head with Lc_Block and SE Module:
Model Summary: 319 layers, 4378838 parameters, 4378838 gradients, 8.6 GFLOPS
There are also some small component changes , such as SE module Of Hard sigmoid Replace with Silu, It can increase a little and speed up ( This point follows v5 Great God walks ), Another is to avoid onnx No, h-sigmoid This operator , You need to refactor the operator ( This reconstruction will cause a slight decrease in accuracy , So replacing the activation function is the most worry free work ).
4. performance
The performance of the model after reproduction is as follows :
stay [email protected] and [email protected]:0.95 Everything is better than the original yolov5s About three points less , The amount of parameters and calculation is about twice less .
However , None of the above is the point , I think the most important thing is performance , So using PPLcnet and yolov5s stay openvino To evaluate , The test hardware is Inter Core @i5-10210.
First extract onnx Model :
$ python models/export.py --weights PPLcnet.pt --img 640 --batch 1
$ python -m onnxsim PPLcnet.onnx PPLcnet-sim.onnx
And then PPLcnet-sim.onnx Turn into IR Model :
$ python mo.py --input_model PPLcnet-yolo.onnx -s 255 --data_type FP16 --reverse_input_channels --output Conv_462,Conv_478,Conv_494
Empathy ,yolov5s Is the same
$ python models/export.py --weights yolov5s.pt --img 640 --batch 1
$ python -m onnxsim yolov5s.onnx yolov5s-sim.onnx
$ python mo.py --input_model PPLcnet-yolo.onnx -s 255 --data_type FP16 --reverse_input_channels --output Conv_245,Conv_261,Conv_277
here , We can get four models :
Model comparison :
Then test , total 50 A picture ,For Cycle are 1000 Second forward reasoning , Calculate the average time per picture :
The test shows that ,input size=640*640,PPLcnet A previous reasoning of than the original yolov5s fast 3 About times , Some sample views .
PPLcnet-yolo Forward Example:
YOLOv5s Forward Example:
Leaving a message. :
Later, the reproduced experiments and codes will be merged into the main branch :
https://github.com/ppogg/YOLOv5-Lite
Welcome to go whoring for nothing , If you have any questions, you can ask issue, It will be solved as soon as possible .
in addition , This is for cpu The designed model , Please use openvino Or other cpu Deploy and evaluate the forward reasoning framework !!!
YOLOv5 6.0 Here comes the version
Here we go , Yesterday I saw YOLOv5 Released the Sixth Edition :
The performance of the model has improved :
There is still no innovation , But there is a breakthrough in engineering value , Embodied in computing resources and reasoning time .
in addition , I think there are three main highlights ,YOLOv5-Nano Adaptation to mobile devices ,Focus Layer changes ,SPP The changes to the :
1. YOLOv5-Nano Performance of :
Before, the device with focus Quantitative version of layer yolov5s Model , It's easy to find this thing collapsing , For small models ,v5 The great God is directly replaced , Maybe it's for stability , After all conv3*3 Convolution optimization on different frameworks has been very mature , For most small models , The parameters of its own model and the amount of calculation generated by runtime are not much , Use focus It is also difficult to play the role of reducing parameters and calculation amount , It can be more stable when quantifying .
2. Focus Layer changes :
3. SPP→SPPF:
By the way!!!v4 Daniel ,v5 A great god , still more Scale Yolov4 The author of , Three people are called in the community commit Maniac , For a period of time, I saw them every day update, This craftsmanship is really admirable ..
边栏推荐
- excel函数统计已存在数据的数量
- Direct addition is more appropriate
- VR/AR 的产业发展与技术实现
- th:include的使用
- The bank needs to build the middle office capability of the intelligent customer service module to drive the upgrade of the whole scene intelligent customer service
- Can you write the software test questions?
- Analysis ideas after discovering that the on duty equipment is attacked
- What are the types of system tests? Let me introduce them to you
- Ml self realization /knn/ classification / weightlessness
- Semantic segmentation | learning record (4) expansion convolution (void convolution)
猜你喜欢
很多小夥伴不太了解ORM框架的底層原理,這不,冰河帶你10分鐘手擼一個極簡版ORM框架(趕快收藏吧)
Semantic segmentation | learning record (2) transpose convolution
leetcode 865. Smallest Subtree with all the Deepest Nodes | 865. The smallest subtree with all the deepest nodes (BFs of the tree, parent reverse index map)
Kwai applet guaranteed payment PHP source code packaging
QT -- create QT program
牛熊周期与加密的未来如何演变?看看红杉资本怎么说
线程死锁——死锁产生的条件
Dnn+yolo+flask reasoning (raspberry pie real-time streaming - including Yolo family bucket Series)
Leetcode featured 200 channels -- array article
JVM memory and garbage collection-3-runtime data area / method area
随机推荐
CorelDRAW2022下载安装电脑系统要求技术规格
JVM memory and garbage collection -4-string
关于TXE和TC标志位的小知识
Semantic segmentation | learning record (4) expansion convolution (void convolution)
Clickhouse principle analysis and application practice "reading notes (8)
Flutter 3.0框架下的小程序运行
excel函数统计已存在数据的数量
Thread deadlock -- conditions for deadlock generation
Mqtt x newsletter 2022-06 | v1.8.0 release, new mqtt CLI and mqtt websocket tools
Monthly observation of internet medical field in May 2022
Popular science | what is soul binding token SBT? What is the value?
cv2-drawline
谈谈 SAP iRPA Studio 创建的本地项目的云端部署问题
Redisson distributed lock unlocking exception
Node JS maintains a long connection
Strive to ensure that domestic events should be held as much as possible, and the State General Administration of sports has made it clear that offline sports events should be resumed safely and order
Spock单元测试框架介绍及在美团优选的实践_第三章(void无返回值方法mock方式)
Anan's judgment
Coreldraw2022 download and install computer system requirements technical specifications
Discrimination gradient descent