当前位置：网站首页>CUDA realizes focal_ loss

CUDA realizes focal_ loss

2022-07-06 08:56:00 【cyz0202】

Reference from ：mmdetection Source code reading ：cuda Expanding focal loss - You know
Readers need a general understanding of CUDA Programming and loss function principle ; This article does not give a detailed introduction

CUDA Realize accelerated writing （ tricks ）

The pictures are from the above references （ Invasion and deletion ）, The red text is my annotation ;

That may still be vague , I'll explain it in detail ;

Instance to explain ：focal_loss cuda Realization

notes ： The code content of this section is from the beginning reference article

step1：python End calls （ The source code in mmdetection tool kit ）

Notice the call _sigmoid_focal_loss( namely sigmoid_focal_loss), This function is essentially right cuda Version of focal_loss Encapsulation of implementation , Let's introduce ;

from mmcv.ops import sigmoid_focal_loss as _sigmoid_focal_loss 
# sigmoid_focal_loss In fact, that is SigmoidFocalLossFunction Of forward Method 

class FocalLoss(nn.module):
    def forward(self,
        pred, # tensor(num_total_anchors, num_classes)
        target, # tensor(num_total_anchors, )
        ):
     
        if if torch.cuda.is_available() and pred.is_cuda:
            loss = _sigmoid_focal_loss(pred.contiguous(), target.contiguous(), gamma,alpha, None, 'none')
            #  Be careful to go through contiguous Ensure continuous memory storage ！ In this way cuda If the kernel function accesses continuous memory, there will be no error 
     
        return loss

step2：autograd.Function Use , Used to specify the forward and backward calculation method （ Why? ： We use it cuda Defines an operator , After encapsulation, you need to use autograd Give Way torch Know how to do forward and backward calculation ）

You can see the inheritance Function; Used ext_module（ From the next step python-cuda binding ）; Used Function.apply()

class SigmoidFocalLossFunction(Function):
    @staticmethod
    def forward(ctx,
                input,
                target,
                gamma=2.0,
                alpha=0.25,
                weight=None,
                reduction='mean'):

        #  Storage reduction_dict、gamma wait until ctx, For reverse transmission backward call 
        ctx.reduction_dict = {'none': 0, 'mean': 1, 'sum': 2}
        assert reduction in ctx.reduction_dict.keys()
        ctx.gamma = float(gamma)
        ctx.alpha = float(alpha)
        ctx.reduction = ctx.reduction_dict[reduction]

        output = input.new_zeros(input.size()) #  open up output Space 

        #  Call the real cuda expand ： here ext_module Is used to bind cuda Version of the code 
        ext_module.sigmoid_focal_loss_forward(input, target, weight, output, gamma=ctx.gamma, alpha=ctx.alpha)
        

        if ctx.reduction == ctx.reduction_dict['mean']:
            output = output.sum() / input.size(0)
        elif ctx.reduction == ctx.reduction_dict['sum']:
            output = output.sum()
        ctx.save_for_backward(input, target, weight)  #  Save variables for reverse calculation 
        return output

    @staticmethod
    @once_differentiable
    def backward(ctx, grad_output):
        input, target, weight = ctx.saved_tensors

        grad_input = input.new_zeros(input.size())

        #  Call the real cuda expand     
        ext_module.sigmoid_focal_loss_backward(input, target,weight, grad_input, gamma=ctx.gamma, alpha=ctx.alpha)
        

        grad_input *= grad_output
        if ctx.reduction == ctx.reduction_dict['mean']:
            grad_input /= input.size(0)
        return grad_input, None, None, None, None, None

#  Definition sigmoid_focal_loss by SigmoidFocalLossFunction.apply,apply Method will call forward
sigmoid_focal_loss = SigmoidFocalLossFunction.apply

step3： Use pybind binding python-cuda(c++), To call python Of module To call cuda Function of ;

This process takes place in c++ Defined in the file ;
TORCH_EXTENSION_NAME It will correspond to a name you specify , For the above code is ext_module（ It can also be called Zhang Sanli Si , This is usually in setup.py Specified in the ）;
m Express module,def The first parameter of is the name (name), The second parameter is to be bound c++/cuda function , Here is the ext_module.sigmoid_focal_loss_forward. Bound to the At present c++ File defined sigmoid_focal_loss_forward function ; This function really involves CUDA Accelerated functions ; Let's introduce ;
def The other parameters of are the parameters of the correlation function , It does not affect the understanding here , Do not go into
sigmoid_focal_loss_forward The function will be called SigmoidFocalLossForwardCUDAKernelLauncher function ; See below

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {  
   m.def("sigmoid_focal_loss_forward", &sigmoid_focal_loss_forward,
        "sigmoid_focal_loss_forward ", py::arg("input"), py::arg("target"),
        py::arg("weight"), py::arg("output"), py::arg("gamma"),
        py::arg("alpha"));
}
#  Empathy sigmoid_focal_loss_backward A little

step4：SigmoidFocalLossForwardCUDAKernelLauncher, Set according to various conditions CUDA The organization of threads and other basic settings （ It can be understood as preparing the environment ）, Then call the real one. CUDA kernel Calculate the core part （ Here is the focal_loss The calculation of , Really working ）

Launcher This name is more vivid , But readers are elsewhere You may not see Launcher This way of writing , Because this is not a rule ; But there must be related functions to realize this part of the function ;

Here are c++ Code ：

Get the relevant parameters and make some basic judgments ; Also set the thread organization , Namely <<<a, b>>> Internal a,b
When the environment is ready , You can call kernel 了 ;
AT_DISPATCH_FLOATING_TYPES_AND_HALF Is a macro with parameters , The thing to do is to put the first parameter （ data type ） Pass it on to the back scalar_t, And then execute [&]() Anonymous function in brackets （ Namely kernel 了）

It was also noted that Tensor Provides .data_ptr<scalar_t>() Template member function , Will return tensor The first address of continuous storage , And convert to scalar_t * Pointer type of . We know tensor What is really stored in memory is One dimensional continuous array ！tensor(B,C,H,W) What is really stored in memory is long B*C*H*W Continuous array of ;data_ptr() What is returned is the first address of this continuous array ;

void SigmoidFocalLossForwardCUDAKernelLauncher(Tensor input, Tensor target,
                                               Tensor weight, Tensor output,
                                               const float gamma,
                                               const float alpha) {
// input by tensor(num_total_anchors, num_classes)
// output by tensor(num_total_anchors, num_classes)
// target by tensor(num_total_anchors, ),0 ~ num_class-1 Indicates the category corresponding to the positive sample ,num_class Values represent negative samples and ignored samples 
  int output_size = output.numel(); // be equal to num_total_anchors*num_classes
  int num_classes = input.size(1);
  AT_ASSERTM(target.max().item<long>() <= (long)num_classes,
           "target label should smaller or equal than num classes");
  at::cuda::CUDAGuard device_guard(input.device());
  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  AT_DISPATCH_FLOATING_TYPES_AND_HALF(
      input.scalar_type(), "sigmoid_focal_loss_forward_cuda_kernel", [&] {
        sigmoid_focal_loss_forward_cuda_kernel<scalar_t> 
       //sigmoid_focal_loss_forward_cuda_kernel yes cuda Kernel function , Is defined as a template function , adopt <scalar_t> Determine the data type 
            <<<GET_BLOCKS(output_size), THREADS_PER_BLOCK, 0, stream>>>(
                output_size, input.data_ptr<scalar_t>(),
                target.data_ptr<int64_t>(), weight.data_ptr<scalar_t>(),
                output.data_ptr<scalar_t>(), gamma, alpha, num_classes);
      });

  AT_CUDA_CHECK(cudaGetLastError());
}

//backward A little 

//  Here are CUDA Calculate the thread organization that needs to be set , For unfamiliar ones, please refer to CUDA Programming 
#define THREADS_PER_BLOCK 1024、128、512

inline int GET_BLOCKS(const int N) {
  int optimal_block_num = (N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
  int max_block_num = 65000;
  return min(optimal_block_num, max_block_num);
} 
//  Get the opened thread block x dimension , bring blockDim.x * gridDim.x To be equal to N

step5：kernel Realization

This is the real work ; At this time, you have to understand focal_loss Calculation method of , Just know how to write ;（ In fact, the above definition <<<a,b>>> Inside a、b When parameters are , We should consider how to write this part ）
First deduce focal_loss Well , The process is as follows
The next step is to realize ,kernel The content of represents what each thread should do ; Here we need to calculate loss, And it is classified loss; say concretely , Input includes
- pred, It's a tensor,shape by [num_total_anchors, num_classes];
- as well as target, Also a tensor,shape by [num_total_anchors];
- There may be weight, Represents the loss weight of each category （ Some categories are not important , Just give a small weight , In this way, even if the prediction loss is large , Finally, because the weight is small, it has little effect on the average loss ）
This multi classification is usually implemented loss, Namely softmax;mmdetection It uses sigmoid+focal_loss; Is the use sigmoid Yes num_classes All categories inside calculate losses , combination focal_loss（ Positive and negative classes add different weights ）, Finally, do synthesis （ For example, take the average to get the average loss ）;

With the above ideas , We know how to design CUDA Multithreaded computing for （ It's design. kernel 了）;

Since we should treat everyone anchor Of num_classes Calculate losses for each category （ At this time, a total of N=num_total_anchors*num_classes This multiple loss ）, Then let them all calculate in parallel ;
So we need num_total_anchors*num_classes So many threads ; This is the top <<<a,b>>> in a Of GET_BLOCK Parameter set to output_size= num_total_anchors*num_classes Why ; Of course CUDA There are not so many threads that can be used at the same time , No problem , Set up another CUDA The number of bus routes in the range , The remaining workload makes these threads execute circularly ; This is the top GET_BLOCK What to do , Please see below
CUDA It is required to set the thread block size （ General name THREADS_PER_BLOCK） And the number of thread blocks （ General name BLOCKS or BLOCK_NUM or BLOCKS_PER_GRID）; Usually, the product of the two is slightly larger than the total number of tasks （ For example, the number of losses to be calculated above N）; however CUDA Not so many threads , So you have to control the size of the two constants , For example, above GET_BLOCK In the last setting max Control the number of thread blocks ;
Now there is THREADS_PER_BLOCK*BLOCKS_PER_GRID So many threads , It can be understood that they will be in CUDA Internal parallel execution , Each thread calculates a loss , The remaining losses are executed by the previous thread loop ;

According to the introduction of the above steps and the above focal_loss Derivation of forward and backward calculation , We can write the following kernel, A few notes ：

CUDA_1D_KERNEL_LOOP(i, n) It is a macro with parameters , Used to define a loop , The thread mentioned above is not enough to loop
- i yes CUDA Automatically calculate the total of the current thread index,n Is the total number of tasks calculated above （ Quantity of all losses to be calculated , namely output_size）
therefore index Used to locate which loss the current thread should calculate ; Notice the previous mention , multidimensional tensor It actually exists in one-dimensional form in memory , therefore kernel The parameter of is in the form of pointer ; Use at this time input[index] Then we found the corresponding position ; Then we can calculate the current number anchor( In this paper, n) And the anchor What kind of losses are there ;
The next calculation is based on the above focal_loss Forward and backward derivation formula of ; The whole is relatively simple
- Pay attention to backward calculation （backward function ）, The number of parallel threads can be compared with forward calculation forward Dissimilarity , But the calculation method is the same as forward To coordinate （ For example, calculate for each position loss It is necessary to calculate for each position grad）

The code is as follows

#define CUDA_1D_KERNEL_LOOP(i, n)                            \
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
       i += blockDim.x * gridDim.x)
// blockDim.x * gridDim.x Is the total number of threads currently opened 

template <typename T>
__global__ void sigmoid_focal_loss_forward_cuda_kernel(
    const int nthreads, const T* input, const int64_t* target, const T* weight,
    T* output, const T gamma, const T alpha, const int num_classes) {
// nthreads Namely outputsize, be equal to num_total_anchors*num_classes
// const T* Namely tensor The first address of continuous memory 
// input The continuous memory length of is num_total_anchors*num_classes,target The continuous memory length of is num_total_anchors.
  CUDA_1D_KERNEL_LOOP(index, nthreads) {
    // index be equal to blockIdx.x * blockDim.x + threadIdx.x, Thread index 

    //  because index It's corresponding to tensor(num_total_anchors,num_classes) An element of 
    int n = index / num_classes; //  therefore n Is the corresponding of this element anchor
    int c = index % num_classes; //  therefore c Is the corresponding of this element class

    int64_t t = target[n]; //  get anchor n  Of target label
    T flag_p = (t == c); //  Indicates a positive sample 
    T flag_n = (t != c); //  Negative sample 

    // p = sigmoid(x) = 1. / 1. + expf(-x)
    T p = (T)1. / ((T)1. + expf(-input[index]));

    // (1 - p)**gamma * log(p)  Positive sample focal loss The weight 
    T term_p = pow(((T)1. - p), gamma) * log(max(p, (T)FLT_MIN));
    // p**gamma * log(1 - p)  Negative sample focal loss The weight 
    T term_n = pow(p, gamma) * log(max((T)1. - p, (T)FLT_MIN));

    output[index] = (T)0.; // The calculation results are put in output tensor in 
    output[index] += -flag_p * alpha * term_p;
    output[index] += -flag_n * ((T)1. - alpha) * term_n;
    if (weight != NULL) {
      output[index] *= weight[t];
    }
  }
}

// Empathy , Back propagation 
template <typename T>
__global__ void sigmoid_focal_loss_backward_cuda_kernel(
    const int nthreads, const T* input, const int64_t* target, const T* weight,
    T* grad_input, const T gamma, const T alpha, const int num_classes) {
  CUDA_1D_KERNEL_LOOP(index, nthreads) {
    int n = index / num_classes;
    int c = index % num_classes;

    int64_t t = target[n];
    T flag_p = (t == c);
    T flag_n = (t != c);

    // p = sigmoid(x) = 1. / 1. + expf(-x)
    T p = (T)1. / ((T)1. + exp(-input[index]));

    // (1 - p)**gamma * (1 - p - gamma*p*log(p))
    T term_p = pow(((T)1. - p), gamma) *
               ((T)1. - p - (gamma * p * log(max(p, (T)FLT_MIN))));
    // p**gamma * (gamma * (1 - p) * log(1 - p) - p)
    T term_n = pow(p, gamma) *
               (gamma * ((T)1. - p) * log(max((T)1. - p, (T)FLT_MIN)) - p);

    grad_input[index] = (T)0.;
    grad_input[index] += -flag_p * alpha * term_p;
    grad_input[index] += -flag_n * ((T)1. - alpha) * term_n;
    if (weight != NULL) {
      grad_input[index] *= weight[t];
    }
  }
}

summary

So that's the introduction mmdetection Medium focal_loss Operator's CUDA Realization ;
Operator implementation has a relatively fixed process , From top to bottom are ：
1. Python call autograd Function
2. Above autograd Define forward and backward algorithms
3. The forward and backward algorithm above refers to Python Used to bind c++ Functional module
4. Above module There are forward and backward functions mapped to c++ Forward and backward functions of
5. c++ Preparation environment for forward and backward functions of （ The most important thing is the thread organization ） And call CUDA kernel
6. kernel Definition CUDA How to implement specific operators （ As above focal_loss）
The difficulty lies in how to achieve the final kernel; Specific tasks should be decomposed into What multiple threads can do in parallel , Only then can it be realized kernel;
TODO： Other more complex operators CUDA Realization

原网站

版权声明
本文为[cyz0202]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/187/202207060850361110.html