当前位置:网站首页>Engineering deployment (III): optimization of low computing power platform model performance
Engineering deployment (III): optimization of low computing power platform model performance
2022-07-08 02:20:00 【pogg_】

Preface : This article discusses how to improve the performance of the model on low-end mobile devices , The article aims at the model ( Do not change the original model op Under the circumstances , No need to retrain ) And post-processing optimization , If there is something wrong , Hope the criticism points out !
One 、 Model optimization
1.1 op The fusion
The model optimization here refers to the model convolution layer and bn Fusion of layers or conection,identity Isoparametric operation , Changing your mind comes from a discussion you didn't intend to participate in one day :
The boss thinks fuse It can be done , But it's not necessary ,fuse(conv+bn)=CB Its function lies in others , And it has little effect on speed increase , But I am more insistent on my point of view , because yolov5 The comparison is based on high computing power graphics cards , Low end card , Not even GPU,NPU The blessed equipment has an obvious speed-up effect .
Especially for too much reuse group conv or depthwise conv Model of , for instance ,shufflenetv2 Regarded as an efficient mobile network, it is often used on the end side backbone, We see a single shuffle block(stride=2) The component of uses two deep separable convolutions :
Just a whole set of network is used 25 Group depthwise conv( The reason lies in shufflenet The series is low computing power cpu Equipment design , It is inevitable to reuse a large number of deep separation convolutions )
So with this original intention , Made a set based on v5lite-s Model experiment , And post the test results for everyone to exchange :
The above test results are based on shuffle block All convolution sums of bn The result of layer fusion , extract coco val2017 Medium 1000 A picture to test , You can see , stay i5 On the core of ,fuse The later model is x86 cpu The previous single forward acceleration is obvious . If for arm End cpu, The effect will be more obvious .
The fusion script is as follows :
import torch
from thop import profile
from copy import deepcopy
from models.experimental import attempt_load
def model_print(model, img_size):
# Model information. img_size may be int or list, i.e. img_size=640 or img_size=[640, 320]
n_p = sum(x.numel() for x in model.parameters()) # number parameters
n_g = sum(x.numel() for x in model.parameters() if x.requires_grad) # number gradients
stride = max(int(model.stride.max()), 32) if hasattr(model, 'stride') else 32
img = torch.zeros((1, model.yaml.get('ch', 3), stride, stride), device=next(model.parameters()).device) # input
flops = profile(deepcopy(model), inputs=(img,), verbose=False)[0] / 1E9 * 2 # stride GFLOPS
img_size = img_size if isinstance(img_size, list) else [img_size, img_size] # expand if int/float
fs = ', %.6f GFLOPS' % (flops * img_size[0] / stride * img_size[1] / stride) # imh x imw GFLOPS
print(f"Model Summary: {
len(list(model.modules()))} layers, {
n_p} parameters, {
n_g} gradients{
fs}")
if __name__ == '__main__':
load = 'weights/v5lite-e.pt'
save = 'weights/repv5lite-e.pt'
test_size = 320
print(f'Done. Befrom weights:({
load})')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = attempt_load(load, map_location=device) # load FP32 model
torch.save(model, save)
model_print(model, test_size)
print(model)
The fusion op The core code is as follows :
if type(m) is Shuffle_Block:
if hasattr(m, 'branch1'):
re_branch1 = nn.Sequential(
nn.Conv2d(m.branch1[0].in_channels, m.branch1[0].out_channels,
kernel_size=m.branch1[0].kernel_size, stride=m.branch1[0].stride,
padding=m.branch1[0].padding, groups=m.branch1[0].groups),
nn.Conv2d(m.branch1[2].in_channels, m.branch1[2].out_channels,
kernel_size=m.branch1[2].kernel_size, stride=m.branch1[2].stride,
padding=m.branch1[2].padding, bias=False),
nn.ReLU(inplace=True),
)
re_branch1[0] = fuse_conv_and_bn(m.branch1[0], m.branch1[1])
re_branch1[1] = fuse_conv_and_bn(m.branch1[2], m.branch1[3])
# pdb.set_trace()
# print(m.branch1[0])
m.branch1 = re_branch1
if hasattr(m, 'branch2'):
re_branch2 = nn.Sequential(
nn.Conv2d(m.branch2[0].in_channels, m.branch2[0].out_channels,
kernel_size=m.branch2[0].kernel_size, stride=m.branch2[0].stride,
padding=m.branch2[0].padding, groups=m.branch2[0].groups),
nn.ReLU(inplace=True),
nn.Conv2d(m.branch2[3].in_channels, m.branch2[3].out_channels,
kernel_size=m.branch2[3].kernel_size, stride=m.branch2[3].stride,
padding=m.branch2[3].padding, bias=False),
nn.Conv2d(m.branch2[5].in_channels, m.branch2[5].out_channels,
kernel_size=m.branch2[5].kernel_size, stride=m.branch2[5].stride,
padding=m.branch2[5].padding, groups=m.branch2[5].groups),
nn.ReLU(inplace=True),
)
re_branch2[0] = fuse_conv_and_bn(m.branch2[0], m.branch2[1])
re_branch2[2] = fuse_conv_and_bn(m.branch2[3], m.branch2[4])
re_branch2[3] = fuse_conv_and_bn(m.branch2[5], m.branch2[6])
# pdb.set_trace()
m.branch2 = re_branch2
# print(m.branch2)
self.info()
The following figure is not carried out fuse Model parameter quantity of , Amount of computation , And the individual shuffle block Structure , You can see the unmerged shuffle block In a single branch2 Branches contain 8 Height op.
The parameters of the fused model are reduced 0.5 ten thousand , The amount of calculation is less 0.6 ten thousand , Mainly from bn layer , And you can see individual branch2 In the branch op Three less , A complete set of backbone The network has been reduced 25 individual bn layer 
1.2 Re parameterization
The repartition operation mentioned in the preface is more important than op The fusion , Introduce the previously mentioned g Model : Pursuit of perfection :Repvgg Reparameterization pairs YOLO Experiment and thinking of industrial landing (https://zhuanlan.zhihu.com/p/410874403), because g The model is high performance gpu involve ,backbone Used repvgg, Pass during training rbr_1x1 and identity Go up , But reasoning must be re parameterized into 3×3 Convolution , To have high cost performance , The most intuitionistic , Use the following code for each repvgg block Re parameterization and fusion :
if type(m) is RepVGGBlock:
if hasattr(m, 'rbr_1x1'):
# print(m)
kernel, bias = m.get_equivalent_kernel_bias()
rbr_reparam = nn.Conv2d(in_channels=m.rbr_dense.conv.in_channels,
out_channels=m.rbr_dense.conv.out_channels,
kernel_size=m.rbr_dense.conv.kernel_size,
stride=m.rbr_dense.conv.stride,
padding=m.rbr_dense.conv.padding, dilation=m.rbr_dense.conv.dilation,
groups=m.rbr_dense.conv.groups, bias=True)
rbr_reparam.weight.data = kernel
rbr_reparam.bias.data = bias
for para in self.parameters():
para.detach_()
m.rbr_dense = rbr_reparam
# m.__delattr__('rbr_dense')
m.__delattr__('rbr_1x1')
if hasattr(self, 'rbr_identity'):
m.__delattr__('rbr_identity')
if hasattr(self, 'id_tensor'):
m.__delattr__('id_tensor')
m.deploy = True
m.forward = m.fusevggforward # update forward
# continue
# print(m)
if type(m) is Conv and hasattr(m, 'bn'):
# print(m)
m.conv = fuse_conv_and_bn(m.conv, m.bn) # update conv
delattr(m, 'bn') # remove batchnorm
m.forward = m.fuseforward # update forward
""" Re parameterization is required before fuse operation , Otherwise, re parameterization will fail """
The following results can directly see the number of model layers 、 There are obvious changes in the amount of calculation and parameters , The following figure shows the model parameters and calculation amount before and after re parameterization 、 Model structure ::
Two 、 post-processing
2.1 Inverse function operation
The optimization of post-processing is also important , The purpose of post-processing optimization is to reduce inefficient loops or judgment statements , Avoid using expensive operators in large numbers .
We use yolov5 be based on ncnn demo Test and modify the code , But because the source code links too many libraries , Let's smoke alone general_poprosal function , imitation general_poprosal Write a paragraph using sigmoid Calculation confidence Compare again 80 class , Calculation bbox Coordinate operation .
float sigmoid(float x)
{
return static_cast<float>(1.f / (1.f + exp(-x)));
}
vector<float> ram_cls_num(int num)
{
std::vector<float> res;
float a = 10.0, b = 100.0;
srand(time(NULL));// Set random number seed , Make each random sequence different
cout<<"number class:"<<endl;
for (int i = 1; i <= num; i++)
{
float number = rand() % (N + 1) / (float)(N + 1);
res.push_back(number);
cout<<number<<' ';
}
cout<<endl;
return res;
}
int sig()
{
int num_anchors = 3;
int num_grid_y = 224;
int num_grid_x = 224;
float prob_threshold = 0.6;
std::vector<float> num_class = ram_cls_num(80);
clock_t start, ends;
start = clock();
for (int q = 0; q < num_anchors; q++)
{
for (int i = 0; i < num_grid_y; i++)
{
for (int j = 0; j < num_grid_x; j++)
{
float tmp = i * num_grid_x + j;
float box_score = rand() % (N + 1) / (float)(N + 1);
// find class index with max class score
int class_index = 0;
float class_score = 0;
for (int k = 0; k < num_class.size(); k++)
{
float score = num_class[k];
if (score > class_score)
{
class_index = k;
class_score = score;
}
}
float prob_threshold = 0.6;
float confidence = sigmoid(box_score) * sigmoid(class_score);
if (confidence >= prob_threshold)
{
float dx = sigmoid(1);
float dy = sigmoid(2);
float dw = sigmoid(3);
float dh = sigmoid(4);
}
}
}
}
ends = clock() - start;
cout << "sigmoid function cost time:" << ends << "ms" <<endl;
return 0;
}
It takes time here :
number class:
0.65 0.08 0.62 0.33 0.79 0.7 0.44 0 0.96 0.75 0.92 0.66 0.54 0.23 0.14 0.75 0.94 0.88 0.76 0.81 0.28 0.37 0.34 0.19 0.46 0.93 0.79 0.86 0.64 0.55 0.84 0.91 0.33 0.53 0.71 0.53 0.69 0.63 0.67 0.35 0.24 0.97 0.94 0.91 0.66 0.63 0.14 0.4 0.28 0.24 0.29 0.2 0.58 0.65 0.51 0.79 0.49 0.47 0.94 0.84 0.38 0.84 0.88 0.61 0.99 0.17 0.02 0.02 0.42 0.96 0.48 0.6 0.08 0.33 0.84 0.04 0.8 0.22 0.16 0.57
sigmoid function cost time:68ms
Modify the function , First use sigmoid The inverse function of unsigmoid Calculation prob_threshold, There is no need to traverse first 80 Categories find the category with the highest score , You won't encounter cutting into the third for After the cycle, it must be carried out twice sigmoid operation ( Calculation confidence) The problem of , Only when box_score > unsigmoid(prob_threshold) It's going to happen 80 Class max score lookup , Calculate again bbox coordinate ,confidence Etc .
float unsigmoid(float x)
{
return static_cast<float>(-1.0f * (float)log((1.0f / x) - 1.0f));
}
int unsig()
{
int num_anchors = 3;
int num_grid_y = 224;
int num_grid_x = 224;
float prob_threshold = 0.6;
std::vector<float> num_class = ram_cls_num(80);
un_prob = unsigmoid(prob_threshold)
clock_t start, ends;
start = clock();
for (int q = 0; q < num_anchors; q++)
{
for (int i = 0; i < num_grid_y; i++)
{
for (int j = 0; j < num_grid_x; j++)
{
float tmp = i * num_grid_x + j;
float box_score = rand() % (N + 1) / (float)(N + 1);
// find class index with max class score
if (box_score > un_prob )
// First use sigmoid The inverse function of bypasses twice sigmoid, At the same time, put the front of 80 Class comparison comes after judgment , If the conditions are not met, do not
{
int class_index = 0;
float class_score = 0;
for (int k = 0; k < num_class.size(); k++)
{
float score = num_class[k];
if (score > class_score)
{
class_index = k;
class_score = score;
}
}
float confidence = sigmoid(box_score) * sigmoid(class_score);
if (confidence >= prob_threshold)
{
float dx = sigmoid(1);
float dy = sigmoid(2);
float dw = sigmoid(3);
float dh = sigmoid(4);
}
}
}
}
}
ends = clock() - start;
cout << "unsigmoid function cost time:" << ends << "ms" <<endl;
return 0;
}
give the result as follows :
number class:
0.65 0.08 0.62 0.33 0.79 0.7 0.44 0 0.96 0.75 0.92 0.66 0.54 0.23 0.14 0.75 0.94 0.88 0.76 0.81 0.28 0.37 0.34 0.19 0.46 0.93 0.79 0.86 0.64 0.55 0.84 0.91 0.33 0.53 0.71 0.53 0.69 0.63 0.67 0.35 0.24 0.97 0.94 0.91 0.66 0.63 0.14 0.4 0.28 0.24 0.29 0.2 0.58 0.65 0.51 0.79 0.49 0.47 0.94 0.84 0.38 0.84 0.88 0.61 0.99 0.17 0.02 0.02 0.42 0.96 0.48 0.6 0.08 0.33 0.84 0.04 0.8 0.22 0.16 0.57
unsigmoid function cost time:77ms
It seems that the posture is wrong , Let's raise prob_threshold=0.6, Get new results :
sigmoid function cost time:69ms
unsigmoid function cost time:47ms
At this point, you can see the benefits , Keep raising the threshold ,unsigmoid The shorter the function takes , But instead, the targets are stuck by too high a threshold , The second half of the function cannot be . So we can see , Using inverse function calculation can bypass twice sigmoid Index operation of ( Calculation confidense), But whether to use this method still needs to be analyzed according to the actual business , If the target box_score All on the low side , Then this optimization will only become negative optimization .
2.2 omp Multi parallel
If there are a lot of after-treatment for loop , And the loop has no data dependency and function dependency , Consider using openml Library for multi-threaded parallel acceleration , Search for example 80 Class score The highest class :
#pragma omp parallel for num_threads(ncnn::get_big_cpu_count())
for (int k = 0; k < num_class; k++) {
float score = featptr[5 + k];
if (score > class_score) {
class_index = k;
class_score = score;
}
}
Or multithreading calculates the location information of each target :
#pragma omp parallel for num_threads(ncnn::get_big_cpu_count())
for (int i = 0; i < count; i++) {
objects[i] = proposals[picked[i]];
// adjust offset to original unpadded
float x0 = (objects[i].rect.x) / scale;
float y0 = (objects[i].rect.y) / scale;
float x1 = (objects[i].rect.x + objects[i].rect.width) / scale;
float y1 = (objects[i].rect.y + objects[i].rect.height) / scale;
// clip
x0 = std::max(std::min(x0, (float) (img_w - 1)), 0.f);
y0 = std::max(std::min(y0, (float) (img_h - 1)), 0.f);
x1 = std::max(std::min(x1, (float) (img_w - 1)), 0.f);
y1 = std::max(std::min(y1, (float) (img_h - 1)), 0.f);
objects[i].rect.x = x0;
objects[i].rect.y = y0;
objects[i].rect.width = x1 - x0;
objects[i].rect.height = y1 - y0;
}
but ncnn The underlying source code of has realized Parallel Computing , Therefore, there is no accelerating effect , But it can be recorded as a method for later use .
After the above modification, the model detection effect is as follows :
xiaomi 10+CPU(Snapdragon 865):
redmi K30+CPU(Snapdragon 730G):
Code link :https://github.com/ppogg/ncnn-android-v5lite
Welcome star and fork~
边栏推荐
- cv2-drawline
- 关于TXE和TC标志位的小知识
- [knowledge map paper] attnpath: integrate the graph attention mechanism into knowledge graph reasoning based on deep reinforcement
- leetcode 869. Reordered Power of 2 | 869. 重新排序得到 2 的幂(状态压缩)
- Introduction to Microsoft ad super Foundation
- Anan's judgment
- 金融业数字化转型中,业务和技术融合需要经历三个阶段
- excel函数统计已存在数据的数量
- [knowledge map paper] Devine: a generative anti imitation learning framework for knowledge map reasoning
- Xmeter newsletter 2022-06 enterprise v3.2.3 release, error log and test report chart optimization
猜你喜欢

leetcode 865. Smallest Subtree with all the Deepest Nodes | 865.具有所有最深节点的最小子树(树的BFS,parent反向索引map)

1385:团伙(group)

Semantic segmentation | learning record (1) semantic segmentation Preface

云原生应用开发之 gRPC 入门

Xiaobai tutorial: Raspberry pie 3b+onnxruntime+scrfd+flask to realize public face detection system

"Hands on learning in depth" Chapter 2 - preparatory knowledge_ 2.2 data preprocessing_ Learning thinking and exercise answers

LeetCode精选200道--数组篇

Random walk reasoning and learning in large-scale knowledge base

Completion report of communication software development and Application

Strive to ensure that domestic events should be held as much as possible, and the State General Administration of sports has made it clear that offline sports events should be resumed safely and order
随机推荐
idea窗口不折叠
魚和蝦走的路
Learn CV two loss function from scratch (1)
力扣6_1342. 将数字变成 0 的操作次数
PHP calculates personal income tax
XMeter Newsletter 2022-06|企业版 v3.2.3 发布,错误日志与测试报告图表优化
The bank needs to build the middle office capability of the intelligent customer service module to drive the upgrade of the whole scene intelligent customer service
Relationship between bizdevops and Devops
Master go game through deep neural network and tree search
Semantic segmentation | learning record (3) FCN
力扣4_412. Fizz Buzz
Neural network and deep learning-5-perceptron-pytorch
VIM string substitution
How to use diffusion models for interpolation—— Principle analysis and code practice
[reinforcement learning medical] deep reinforcement learning for clinical decision support: a brief overview
WPF custom realistic wind radar chart control
Force buckle 5_ 876. Intermediate node of linked list
金融业数字化转型中,业务和技术融合需要经历三个阶段
Learn CV two loss function from scratch (4)
JVM memory and garbage collection-3-runtime data area / heap area