当前位置：网站首页>Project deployment (I): selection of mobile operators

Project deployment (I): selection of mobile operators

2022-07-08 02:20:00 【pogg_】

Preface ： This article was first published in GiantPandaCV, Please do not reprint without permission

This blog stems from previous discussions with friends , Deploy on the end side 、 On the board with less computing power , Generally like to use Relu and LeakyReLU Activation function , And what we often say is similar Sigmoid,Mish What does function overhead mean ？ This blog will analyze from the experimental level , I learned from scratch before CV One activation function ：https://zhuanlan.zhihu.com/p/380237014 Extension of .

The following are just personal opinions , If there is something wrong , Welcome the criticism that .

One 、 Activation function

For the detailed explanation of the activation function , There are too many materials on the Internet , In our daily work , Frequently seen activation functions are relu, leakyrelu, sigmoid, swish, hardswish, selu, mish wait . Give a few common examples , about YOLO series ,YOLOv3 The activation function used is LeakyReLU,YOLOv4 Through Mish Function to improve model performance , But it brings high expenses ,YOLOv5 The author uses SiLU function 1, As a balance between speed and accuracy .

It can be said that different activation functions bring different gains , But it doesn't mean that the greater the cost , The more expensive the calculation, the activation function must be the most work, Borrow one here v5s Compare the ablation maps of different activation functions ：

The picture above is in v5 Of issue See a chart in which the author compares the performance of different activation functions , You can see , although Mish Functions are expensive , But for the v5s For such small models , The gain is not optimal , by comparison ,swish The effect of function recurrence is more than Mish function , meanwhile ,Swish Functions are also better than Mish Function is fast 15-20%:
Please add a picture description
Test on Nvidia A100 - From oneflow zzk

The picture above shows A100 On the video card , Different activation functions latency and Bandwidth Comparison of .

Two 、 Comparison of the underlying operators of the activation function

But this blog wants to compare the model deployment , The performance of different operators is different , We call ncnn Test the underlying operators of the forward reasoning framework .

Before comparison , In order to ensure Mat Randomness of parameter input , We use only 5 Convolutions and 3 A pooled layer model performs point multiplication on parameters , Then the output result is sent to the activation function operator for operation ：

static int init_net3x3(ncnn::Net* net, int* target_size)
{
    
    net->opt.num_threads = 4; 
    //Test for multi thread
    int ret = 0;

    const char* net_param = "5xConv3x3x128.param";
    const char* net_model = "5xConv3x3x128.bin";
    *target_size = 224;

    ret = net->load_param(net_param);
    if (ret != 0)
    {
    
        return ret;
    }
    ret = net->load_model(net_model);
    if (ret != 0)
    {
    
        return ret;
    }

    return 0;
}

static ncnn::Mat forward_net3x3(const cv::Mat& bgr, int target_size, ncnn::Net* net)
{
    
    int img_w = bgr.cols;
    int img_h = bgr.rows;
    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, bgr.cols, bgr.rows, target_size, target_size);

    ncnn::Extractor ex = net->create_extractor();
    ex.input("input.1", in);
    ncnn::Mat out;
    ex.extract("18", out);

    return out;
}

next , We send the calculated random parameters into the bottom operator of the activation function for calculation ：

static int ReLU(const ncnn::Mat& bottom_top_blob, const ncnn::Option& opt)
{
    
    int w = bottom_top_blob.w;
    int h = bottom_top_blob.h;
    int channels = bottom_top_blob.c;
    int size = w * h;

    #pragma omp parallel for num_threads(opt.num_threads)
        for (int q = 0; q < channels; q++)
        {
    
            ncnn::Mat ptr = bottom_top_blob.channel(q);
            for (int i = 0; i < size; i++)
            {
    
                // fprintf(stderr, "Tensor value: %f ms \n", ptr.channel(q)[i]);
                if (bottom_top_blob.channel(q)[i] < 0)
                {
    
                    ptr.channel(q)[i] = 0;
                }
            }
        }
    return 0;
}

static int Swish(const ncnn::Mat& bottom_top_blob, const ncnn::Option& opt)
{
    
    int w = bottom_top_blob.w;
    int h = bottom_top_blob.h;
    int channels = bottom_top_blob.c;
    int size = w * h;

#pragma omp parallel for num_threads(opt.num_threads)
    for (int q = 0; q < channels; q++)
    {
    
        ncnn::Mat ptr = bottom_top_blob.channel(q);

        for (int i = 0; i < size; i++)
        {
    
            float x = ptr[i];
            ptr[i] = static_cast<float>(x / (1.f + expf(-x)));
        }
    }

    return 0;
}

/* ...The content is too long, omit the remaining 100 lines of code... */

We are Inter [email protected] For each activation function operator 10 Ten thousand times of reasoning , The input parameter quantity is 33128, Calculate a single reasoning Latency, And draw a histogram ：

int main(int argc, char** argv)
{
    
    int target_size = 224;
    ncnn::Net net3x3;
    int ret = init_net3x3(&net3x3, &target_size);
    cv::Mat m = cv::imread("C:/Users/chen/Desktop/3dd980d7f22fd0607c80f5ebc2c1c2e.jpg", 1);
    ncnn::Mat out = forward_net3x3(m, target_size, &net3x3);
    ncnn::Option opt;
    opt.num_threads = 1;
    int forward_times = 100000;
    double tanh_start = GetTickCount();

    for (int i = 0; i < forward_times; i++)
        Tanh(out, opt);
    double tanh_end = GetTickCount();
    fprintf(stderr, "Forward %d times. Tanh cost time: %.5f ms \n", forward_times, (tanh_end - tanh_start));

    /* ...The content is too long, omit the remaining 100 lines of code... */

    return 0;
}

It can be seen that ,ReLU and LeakyReLU It takes the least time , and Mish Functions take the longest , And far more than other activation functions , With ReLU and LeakyReLU As a benchmark , We can see that the complexity of these two functions is constant , That is, only a single operation of addition, subtraction, multiplication and division ：

ReLU Functional expression ：

Please add a picture description
LeakyReLU Functional expression ：
Insert picture description here
The image of two operators ：

Left for ReLU, The right to LeakyReLU.

Mish Functional expression ：

Please add a picture description
Function image ：

More Than This , We expand the parameter quantity to the original 4 times （331024）, Conduct 10 Ten thousand times forward, Get every inference Latency：

You can see , The number of Dangshen increased , The delay ratio between exponential activation function and constant activation function will be larger and larger , When the parameter quantity is 33128 when ,
Insert picture description here
And the parameter value turns to 4 Times ：

Insert picture description here
When the number of input parameters increases , Single forward reasoning floating point operation increase , Functions take up more memory , The direct impact is that the overclocking function of the board is unstable , Maybe friends who have played board know , Memory frequency directly affects the bandwidth of the computing platform , The function / The efficiency of the model is limited by the bandwidth resources of the board or computing platform , This may also be the reason why the operation efficiency of exponential operation is slightly affected after the input parameters are greatly increased .

The following figure shows the memory occupied by different activation functions ：

about ReLU and LeakyReLU Wait for the activation function , It is the most common deployment in lightweight networks and mobile terminals （ I haven't seen it yet Mish Function's lightweight network ）, One side , As a constant order operator , Low time delay , Fast calculation ; On the other hand , It does not involve a large number of complex operation instructions such as exponential operation , Operator for dealing with parameter inflation , Also able to deal with , It is very suitable for the board with extremely scarce computing resources .

原网站

版权声明
本文为[pogg_]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202130540225483.html

当前位置：网站首页>Project deployment (I): selection of mobile operators

Project deployment (I): selection of mobile operators

One 、 Activation function

Two 、 Comparison of the underlying operators of the activation function

边栏推荐

猜你喜欢

随机推荐