当前位置：网站首页>Explain NN in pytorch in simple terms CrossEntropyLoss

Explain NN in pytorch in simple terms CrossEntropyLoss

2022-07-05 09:06:00 【aelum】

Author's brief introduction ： Non Coban transcoding , We are constantly enriching our technology stack
️ Blog home page ：https://raelum.blog.csdn.net
Main areas ：NLP、RS、GNN
If this article helps you , Can pay attention to ️ + give the thumbs-up + Collection + Leaving a message. , This will be the biggest motivation for my creation

Insert picture description here

Catalog

One 、 Preface
Two 、 Theoretical basis
3、 ... and 、 main parameter
- 3.1 Input and output
Four 、 Start from scratch `nn.CrossEntropyLoss`

One 、 Preface

nn.CrossEntropyLoss It is often used as the loss function of multi classification problems （ Readers who don't know about cross entropy can see mine This article ）, This article will focus on PyTorch Of Official documents Explain the important knowledge points one by one （ I won't explain everything ）.

import torch
import torch.nn as nn

Two 、 Theoretical basis

about $C\,(C>2)$ Classification problem , Don't think about it first batch The circumstances of , Set the output of neural network （ Not yet Softmax） by ${x_c\}_{c=1}^C$ , after Softmax Get back

$q_i=\frac{\exp(x_i)}{\sum_{c=1}^C\exp(x_c)}$

Thus, the cross entropy loss of this sample is

$H(p,q)=-\sum_{i=1}^C p_i\log q_i=-\sum_{i=1}^C p_i\log\frac{\exp(x_i)}{\sum_{c=1}^C\exp(x_c)}$

among $(p_1,p_2,\cdots,p_C)$ yes One-Hot vector .

You may as well make $p_y=1\,(y\in\{1,2,\cdots,C\})$ , Others are $0$ , So the above formula becomes

$H(p,q)=-\log\frac{\exp(x_y)}{\sum_{c=1}^C\exp(x_c)}$

Now consider batch The circumstances of , Might as well set batch size by $N$ , The output of the neural network is $\{x_{nc}\}_{nc},\;n=1,\cdots,N,\;c=1,\cdots,C$ , The first $n$ The real category of samples is recorded as $y_n\,(y_n\in\{1,2,\cdots,C\})$ , The first $n$ The cross entropy loss of samples is recorded as $l_n$ , Then follow the above formula

$l_n=-\log \frac{\exp(x_{n,y_n}{})}{\sum_{c=1}^C\exp(x_{nc})}$

Next, let's discuss some special situations . When Data imbalance when （ The number of samples in a certain class is particularly large , The number of samples in the other category is particularly small ）, We need to arrange a weight for each kind of loss to balance . The weight of $\boldsymbol{w}=(w_1,w_2,\cdots,w_C)$ .

The model is easy in the one with the largest number of samples （ Or a few ） Over fitting on class , So for those classes with a small number of samples , We need to set a higher weight , In this way, once the model makes an error in predicting the labels of these classes , Will be punished more

After arranging the weight , The corresponding loss is

$l_n=-w_{y_n}\log \frac{\exp(x_{n,y_n}{})}{\sum_{c=1}^C\exp(x_{nc})}$

After calculating $l_1,l_2,\cdots,l_N$ after , We can put them all at once All return （ Corresponding reduction=none）, You can also return their mean value （ Corresponding reduction=mean）, You can also return their and （ Corresponding reduction=sum）：

$\ell=\begin{cases} (l_1,\cdots,l_N),&\text{reduction=none} \\ \sum_{n=1}^N l_n/\sum_{n=1}^N w_{y_n},&\text{reduction=mean} \\ \sum_{n=1}^N l_n,&\text{reduction=sum} \\ \end{cases}$

stay NLP Tasks , We often add filler elements to the end of each sequence , In this way, sequences of different lengths can be loaded in batches . During training , We don't want the filler elements predicted by the network to be included in the loss function . It is advisable to set the index of filler element in the thesaurus as $i$ , Then deal with $l_n$ Make the following amendments ：

$l_n=-w_{y_n}\cdot \mathbb{I}(y_n\neq i)\cdot\log \frac{\exp(x_{n,y_n}{})}{\sum_{c=1}^C\exp(x_{nc})},\qquad \text{where}\; \mathbb{I}(x)= \begin{cases} 1,&x\; \text{is True} \\ 0,&x\; \text{is False} \end{cases}$

in addition , In this scenario reduction=mean The corresponding loss becomes

$\ell=\sum_{n=1}^N\frac{l_n}{\sum_{n=1}^Nw_{y_n}\cdot \mathbb{I}(y_n\neq i)}$

It should be noted that , stay PyTorch in $y_n\in\{0,1,\cdots,C-1\}$ , Here we use $\{1,2,\cdots,C\}$ In order to connect the context more naturally

3、 ... and 、 main parameter

nn.CrossEntropyLoss The main parameters are as follows ：

nn.CrossEntropyLoss(weight=None, ignore_index=-100, reduction='mean', label_smoothing=0.0)

️ size_average and reduce Parameter is deprecated , In its place reduction Parameters , So I won't explain it here

With the bedding in front , We can easily understand these parameters ：

weight： The length is $C$ Tensor , It is generally used when the data is unbalanced ;
ignore_index： Index of categories that need to be ignored , The default is $- 100$ , That is not to ignore ;
reduction： Decide how to return the loss . by none When to return to $N$ Loss of samples , by mean When to return to $N$ The average loss of samples , by sum When to return to $N$ The sum of the loss of samples . The default is mean;
label_smoothing： Decide whether to turn on label smoothing （ Readers who do not understand label smoothing can refer to This article ）, Values in $[0, 1]$ Inside . The default is $0$ , That is, do not open .

3.1 Input and output

Input is divided into input and target,input Usually it is $(N, C)$ The shape of the （ namely batch_size × num_classes）,target Usually it is $(N,)$ The shape of the , Each of these components is located in $\cap \mathbb{Z}$ in , Represents the category to which the sample belongs .

input and target It can also be other types of input , But this article only discusses the most widely used input
input It is the original output of neural network （ Not passed Softmax）,nn.CrossEntropyLoss It will be automatically applied Softmax

torch.manual_seed(0)
batch_size = 3
num_classes = 5
criterion_1 = nn.CrossEntropyLoss(reduction='none')
criterion_2 = nn.CrossEntropyLoss()
criterion_3 = nn.CrossEntropyLoss(reduction='sum')

inputs = torch.randn(batch_size, num_classes)  #  Avoid and input Keyword conflict （ Of course, it doesn't matter ）
target = torch.randint(num_classes, size=(batch_size, ))

print(criterion_1(inputs, target))  #  Output 3 A sample of loss
# tensor([1.4639, 3.0493, 2.3056])
print(criterion_2(inputs, target))  #  Output 3 A sample of loss The average of 
# tensor(2.2729)
print(criterion_3(inputs, target))  #  Output 3 A sample of loss And 
# tensor(6.8188)

print(sum(criterion_1(inputs, target)) == criterion_3(inputs, target))
# tensor(True)
print(sum(criterion_1(inputs, target)) / batch_size == criterion_2(inputs, target))
# tensor(True)

Four 、 Start from scratch `nn.CrossEntropyLoss`

In order to deepen our understanding of , Next, let's start from scratch nn.CrossEntropyLoss（ Of course, it will be different from the official , In order to pursue readability, it will be implemented in a fool's way ）.

First determine the framework （ For simplicity, we don't consider label_smoothing）：

class CrossEntropyLoss(nn.Module):

    def __init__(self, weight=None, ignore_index=-100, reduction='mean'):
        super().__init__()
        self.weight = weight
        self.ignore_index = ignore_index
        self.reduction = reduction
        
    def forward(self, inputs, target):
        pass

For ease of calculation , We rewrite the loss calculation formula in Chapter 2

$l_n=w_{y_n}\cdot \mathbb{I}(y_n\neq i)\cdot[-x_{n,y_n}+\log\sum_{c=1}^C\exp(x_{nc})]$

Adopt more consistent Python To rewrite the above formula

$l_n=\boldsymbol{w}[y_n]\cdot \mathbb{I}(y_n\neq i)\cdot[-\boldsymbol{x_n}[y_n]+\log\sum_{c=1}^C\exp(\boldsymbol{x_n}[c])]$

among $\boldsymbol{w}=(w_1,\cdots,w_C),\;\boldsymbol{x_n}=(x_{n1},\cdots,x_{nC})$ . Re order ${\bf X}=(\boldsymbol{x_1};\cdots;\boldsymbol{x_N}),\;\boldsymbol{y}=(y_1,\cdots,y_C)$ , Obviously ${\bf X}$ It's ours input, $\boldsymbol{y}$ Namely target, So we can do batch calculation

$(l_1,\cdots,l_N)=\boldsymbol{w}[\boldsymbol{y}] *\mathbb{I}(\boldsymbol{y}\neq i)* (-{\bf X}[\text{range}(\text{len}(\boldsymbol{y})),\,\boldsymbol{y}]+\log(\text{sum}(\exp({\bf X}),\,\text{dim}=1)))$

among $*$ Means multiply by elements . The above formula adopts the broadcasting mechanism .

class CrossEntropyLoss(nn.Module):

    def __init__(self, weight=None, ignore_index=-100, reduction='mean'):
        super().__init__()
        self.weight = weight
        self.ignore_index = ignore_index
        self.reduction = reduction

    def forward(self, inputs, target):
        if self.weight is not None:
            n_samples_weight = self.weight[target]  #  The weight of each sample 
        else:
            n_samples_weight = torch.ones_like(target).float()  #  If no weight is provided, all are by default 1
        indicator = (target != self.ignore_index).long().float()  # long() Method can transform Boolean tensor into 0-1 tensor 
        raw_loss = -inputs[torch.arange(len(target)), target] + torch.log(torch.sum(torch.exp(inputs), dim=1))
        result = n_samples_weight * indicator * raw_loss
        if self.reduction == 'mean':
            return torch.sum(result) / n_samples_weight.dot(indicator)
        elif self.reduction == 'sum':
            return torch.sum(result)
        else:
            return result