当前位置：网站首页>Deep understanding of softmax

Deep understanding of softmax

2022-07-08 02:17:00 【Strawberry sauce toast】

Preface

This code is based on Pytorch Realization .

One 、softmax Definition and code implementation of

1.1 Definition

$softmax(x_i) = \frac{exp(x_i)}{\sum_j^nexp(x_j)}$

1.2 Code implementation

def softmax(X):
    '''  Realization softmax  Input X The shape of is [ Number of samples , Output vector dimension ] '''
    return torch.exp(X) / torch.sum(torch.exp(X), dim=1).reshape(-1, 1)

>>> X = torch.randn(5, 5)
>>> y = softmax(X)
>>> torch.sum(y, dim=1)
tensor([1.0000, 1.0000, 1.0000, 1.0000])

Two 、softmax The role of

softmax The output of the linear layer can be normalized and calibrated ： Ensure that the output is nonnegative and the sum is 1.
Because if you directly regard the non normalized output as probability , Will exist 2 Some questions ：

The output of the linear layer does not limit the sum of the output numbers of each neuron to 1;
Depending on the input , The output of the linear layer may be negative .
this 2 Point violates the basic axiom of probability .

3、 ... and 、softmax Overflow on (overflow) Overflow with bottom (underflow)

3.1 It spills over

When $x_i$ When the value of is too large , The value of index operation is too large , If it is beyond the accuracy range , Then overflow .

>>> torch.exp(torch.tensor([1000]))
tensor([inf])

3.2 underflow

When the vector $\boldsymbol x$ Each element of $x_i$ When the values of are negative numbers with large absolute values , be $exp(x_i)$ The value of is very small, and it is taken down beyond the accuracy range 0, The denominator $\sum_jexp(j)$ The values for 0.

>>> X = torch.ones(1, 3) * (-1000)
>>> softmax(X)
tensor([[nan, nan, nan]])

3.3 Avoid spillovers

Reference resources ¹ The technique in ：

Find vector $\boldsymbol x$ Maximum of :
$c=max(\boldsymbol x)$
$s o f t m a x$ The molecules of 、 Divide the denominator by $c$
$softmax(x_i - c) = \frac{exp(x_i-c)}{\sum_j^nexp(x_j-c)}=\frac{exp(x_i)exp(-c)}{\sum_j^nexp(x_i)exp(-c)}=softmax(x_i)$
After the above transformation , The maximum value of the molecule becomes $e x p (0) = 1$ , Avoid upper overflow ;
At least $+ 1$ , Avoid denominator 0 Cause lower overflow .
$\sum_j^nexp(x_j-c) =exp(x_i-c)+exp(x_2-c)+...+exp(x_{max}-c)\\ =exp(x_1-c) + exp(x_2-c)+...+1$

def softmax_trick(X):
    c, _ = torch.max(X, dim=1, keepdim=True)
    return torch.exp(X - c) / torch.sum(torch.exp(X - c), dim=1).reshape(-1, 1)
>>> X = torch.tensor([[-1000, 1000, -1000]])
>>> softmax_trick(X)
tensor([0., 1., 0.])
>>> softmax(X)
tensor([[0., nan, 0.]])

pytorch The implementation of has been done to prevent overflow , therefore , Its operation results are similar to softmax_trick Agreement .

import pytorch.nn.functional as F
>>> X = torch.tensor([[-1000., 1000., -1000.]])
>>> F.softmax(X, dim=1)
tensor([[0., 1., 0.]])

3.4 Log-Sum_Exp Trick²（ take log operation ）

1. Avoid spillage
Logarithmic operation can change multiplication into addition , namely ： $log(x_1x_2) = log(x_1) + log(x_2)$ . When two very small numbers $x_1、x_2$ Multiplying time , The product becomes smaller , If the accuracy is exceeded, it will overflow ; The logarithmic operation turns the product into addition , Reduce the risk of lower overflow .
2. Avoid overflow
$l o g - s o f t m a x$ The definition of ：
$\begin{aligned} log-softmax &=log[softmax(x_i)] \\ &= log(\frac{exp(x_i)}{\sum_j^nexp(x_j)}) \\ &=x_i - log[\sum_j^nexp(x_j)] \end{aligned}$
Make $y=log\sum_j^nexp(x_j)$ , When $x_j$ When the value of is too large , $y$ There is a risk of spillover , therefore , with 3.3 The same in Trick:
$\begin{aligned} y &= log\sum_j^nexp(x_j) \\ & = log\sum_j^nexp(x_j-c)exp(c) \\ & = c +log\sum_j^nexp(x_j-c) \end{aligned}$
When $c=max(\boldsymbol x)$ when , Avoid overflow .
here , $l o g - s o f t m a x$ The calculation formula of becomes ：（ In fact, it is equivalent to directly 3.3 Chaste Trick Take the logarithm ）
$(x_i-c)-log\sum_j^nexp(x_j-c)$
Code implementation ：

def log_softmax(X):
	c, _ = torch.max(X, dim=1, keepdim=True)
	return X - c - torch.log(torch.sum(torch.exp(X-c), dim=1, keepdim=True))
>>> X = torch.tensor([[-1000., 1000., -1000.]])
>>> torch.exp(log_softmax(X))
tensor([[0., 1., 0.]])
# pytorch API Realization 
>>> torch.exp(F.log_softmax(X, dim=1))
tensor([[0., 1., 0.]])

3.5 log-softmax And softmax The difference between ³

combination 3.3 Chaste Trick And my own understanding ：

stay pytorch In the implementation of ,softmax The result of the operation is equivalent to log_softmax The result of is exponentially calculated

>>> X = torch.tensor([[-1000., 1000., -1000.]])
>>> torch.exp(F.log_softmax(X, dim=1)) == F.softmax(X)
tensor([[True, True, True]])

Use $l o g$ It is more convenient to derive after operation , It can speed up the speed of back propagation ⁴
$\begin{aligned} \frac{\partial}{\partial x_i}logsoftmax&=\frac{\partial}{\partial x_i} [{x_i - log\sum_j^nexp(x_j)]} \\ &= 1 - softmax(x_i) \end{aligned}$