当前位置：网站首页>Deep understanding of cross entropy loss function

Deep understanding of cross entropy loss function

2022-07-08 02:17:00 【Strawberry sauce toast】

Preface

This article refers to torch.nn.CrossEntropyLoss() documentation ¹, The cross entropy loss is deeply understood from the principle and implementation details .

One 、 Cross entropy

1.1 The definition of cross entropy

hypothesis X It's a discrete random variable , $p (x) 、 q (x)$ by X Two probability distributions of , The definition of cross entropy is as follows ：
$H(q,p)=-\sum_xq(x)log\ p(x)$
Cross entropy can be used to measure the similarity between two distributions . The smaller the cross entropy , $p 、 q$ The more similar the two distributions are , When $p = q$ when , $H (p, q)$ To achieve the minimum .

1.2 Cross entropy loss

In the classification problem , Cross entropy loss (Cross Entropy Loss) For the definition of ：
$l(y,\hat y)=-\sum_{j=1}^qy_jlog\hat y$
In style , $y$ Label the category of the sample （ The length is $q$ Of one-hot Coding vector ; $\hat y$ The output probability predicted for the model .

Two 、 Maximum likelihood estimation

Likelihood ： The form of the distribution function of a given population , Estimate the parameters of the model distribution function according to the probability of observed events .²
probability ： When the population distribution function is known , Predict the probability of the next event .

2.1 Likelihood function

If overall X It belongs to discrete type , Its distribution law is $P\{X=x\}=p(x;\theta),\theta\in \Theta$ The form of is known , $\theta$ Is the parameter to be estimated , $\Theta$ by $\theta$ Possible value range .
hypothesis $x_1, x_2, x_3,...,x_n$ Is corresponding to the sample $X_1, X_2, X_3,...,X_n$ The sample values of , event $X_1=x_1, X_2=x_2, X_3=x_3,...,X_n=x_n$ The probability of that happening is zero ：
$L(\theta)=L(x_1, x_2, x_3, ...,x_n;\theta)=\prod_{i=1}^np(x_i;\theta)$
In style , $p(x_i;\theta)$ For events $X_i=x_i$ Probability of occurrence . $L(\theta)$ along with $\theta$ The value of , It is called the likelihood function of the sample .
Insert picture description here
Show me your intention for easy understanding ： Likelihood function $L(\theta)$ For events ${X_1=x_1, X_2=x_2, X_3=x_3,...,X_n=x_n\}$ Probability of occurrence .

2.2 Maximum likelihood estimation

The basic idea ： Fixed sample observations $x_1, x_2, x_3,...,x_n$ , stay $\theta$ Select the estimation that maximizes the likelihood function within the possible range of values $\hat \theta$ , namely ：
$L(x_1,x_2,...,x_n;\hat \theta)=\mathop{\max}\limits_{\theta \in\Theta} L(x_1,x_2,...,x_n;\theta)$
The obtained parameter estimation $\hat \theta$ And sample value $x_1,x_2,...,x_n$ of , Write it down as $\hat \theta(x_1, x_2,...,x_n)$ , It's called a parameter $\theta$ Maximum likelihood estimate of .

2.3 Maximum likelihood estimation in classification problems

Assuming that $K$ Classification problem , It is known that $n$ Samples , For model parameters $p(x^{(i)}, y^{(i)})$ Estimate , Then the likelihood function is ：
$L((x^{(i)},y^{(i)};p)=\prod_{i=1}^n \prod_{k=1}^K p^{y(k)}$
Usually, the optimization problem is to take the minimum value instead of the maximum value , therefore Maximizing likelihood function Can be converted to Minimize the negative log likelihood function , Negative log likelihood (negative likelihood) Function is ：
$-log\ L((x^{(i)},y^{(i)};p)=-\sum_{i=1}^n\sum_{k=1}^Ky(k)log\ p_k$
contrast 1.2 The definition of section cross entropy loss function is known ： Minimizing cross entropy loss function and minimizing negative log likelihood function are equivalent in formula .
therefore , It can be downloaded from Maximize sample likelihood To understand the classification model, cross entropy is selected as the loss function .

3、 ... and 、 Realization of cross entropy loss function

torch.nn.CrossEntropyLoss() The description document mentions ：

Note that this case is equivalent to the combination of LogSoftmax and NLLLoss.

Pytorch in log_softmax Has been implemented in blog ³ In detail .
here , Mainly through code understanding Negative log likelihood And Cross entropy The connection between （ Reference resources ⁴).

3.1 Negative log likelihood loss function (Negative Log Likelihood Loss)

Definition of negative log likelihood loss function ：
$nllloss=-\frac{1}{N}\sum_{i=1}^N y_ilog\ \hat y=-\frac{1}{N}\sum_{i=1}^N y_i\ (logsoftmax)$
In style , $N$ Is the number of samples , $y_i$ by one-hot Encoded real sample label , $\hat y$ Is the output probability vector of the model .

>>> import torch
>>> import torch.nn.functional as F
>>> import torch.nn as nn 
>>> X = torch.randn(5, 5)  #  establish 5*5 The sample of （ Number of samples =5,5 Classification problem ）
>>> label = torch.tensor([0, 2, 3, 4, 1])  # 5 Real tags for samples 
>>> label_one_hot = F.one_hot(label).float()
>>> P = F.log_softmax(X, dim=1)  #  Convert the output into probability （ Anti overflow treatment is done here ）
''' To achieve nll loss'''
>>> nllloss = -torch.sum(label_one_hot * P) / label.shape[0] 
tensor(1.9052)
''' call pytorch API  Find the negative log likelihood loss '''
>>> nllloss_1 = F.nll_loss(P, label) #  There is no need to make one-hot Encoding processing 
tensor(1.9052)
''' call pytorch API  Find the cross entropy loss '''
>>> cross_entropy_loss = F.cross_entropy(X, label)
tensor(1.9052)

The final implementation results are consistent .

3.2 To achieve `CrossEntropyLoss` function

Finally, the self - implemented code is given ：

''' Customize log-softmax function , Normalize the model output and prevent overflow '''
def log_softmax(X):
	c, _ = torch.max(X, dim=1, keepdim=True)
	log_sum_exp = c + torch.log(torch.sum(torch.exp(X - c), dim=1, keepdim=True))
	return X - log_sum_exp

''' Custom negative log likelihood function '''
def nll_loss(P_k, label):
	label_one_hot = F.one_hot(label)
	return -torch.sum(label_one_hot * P_k) / label.shape[0] #  Here, take the mean value of all samples （ because cross_entropy Default 'reduction=mean')
>>> X = torch.randn(5, 5)
>>> label = torch.tensor([0,2,3,4,1])
>>> nll_loss(log_softmax(X), label) == F.cross_entropy(X, label)
tensor(True)