当前位置：网站首页>4. Cross entropy

4. Cross entropy

2022-07-08 01:02:00 【booze-J】

article

Cross entropy （cross-entropy）

Cross entropy （cross-entropy）

1. Quadratic cost function （quadratic cost）

Insert picture description here
among ,c It's a cost function ,x Presentation sample ,y Represents the actual value ,a Represents the output value ,n Represents the total number of samples . For the sake of simplicity , Use a sample as an example to illustrate , At this time, the quadratic cost function is ：
Insert picture description here
Suppose we use the gradient descent method （Gradient descent） To adjust the size of the weight parameter , A weight w And offset b The gradient of is derived as follows ：

among ,z Represents the input of a neuron , $\alpha$ Is the activation function .w and b Is proportional to the gradient of the activation function , The greater the gradient of the activation function ,w and b The faster you resize , The faster the training converges . Suppose our activation function is sigmoid function ：
Insert picture description here
Suppose our goal is to converge to 1.0.1 Points for 0.82 It's far from the target , The gradient is bigger , The weight adjustment is relatively large .2 Points for 0.98 Closer to the target , The gradient is smaller , The weight adjustment is relatively small . The adjustment plan is reasonable .
If our goal is to converge to 0.1 Points for 0.82 The goal is relatively close , The gradient is bigger , The weight adjustment is relatively large .2 Points for 0.98 It's far from the target , The gradient is smaller , The weight adjustment is relatively small . The adjustment plan is unreasonable .

2. Cross entropy cost function （cross-entropy）

Another way of thinking , We don't change the activation function , It's changing the cost function , Use the cross entropy cost function instead ：

Insert picture description here
among ,C It's a cost function ,x Presentation sample ,y Represents the actual value ,a Represents the output value ,n Represents the total number of samples .

Insert picture description here

If the output neuron is linear , Then the quadratic cost function is a suitable choice . If the output neuron is S Type of function , Then it is more suitable to use the cross entropy cost function .

3. Logarithmic interpretive cost function （log-likelihood cost）

Logarithmic interpretive function is often used as softmax The cost function of regression , Then the neurons in the output layer are sigmoid function , Cross entropy cost function can be used . The more common practice in deep learning is to softmax As the last layer , At this time, the commonly used cost function is the logarithmic interpretive cost function .
Log likelihood cost function and softmax The combination and cross entropy of sigmoid The combination of functions is very similar . Logarithmic interpretive cost function can be reduced to the form of cross drop cost function in binary classification .
stay tensorflow of use ：tf.nn.sigmoid_cross_entropy_with_logits() To show the following sigmoid Cross line for collocation .
tf.nn.softmax_cross_entropy with_logits() To show the following softmax Cross line for collocation .

Easy to use

We apply it to 3.MNIST Data set classification In the code in , Just modify a simple sentence .
take 3. In the training model

#  Define optimizer ,loss_function, The accuracy of calculation during training 
model.compile(
    optimizer=sgd,
    loss="mse",
    metrics=['accuracy']
)

It is amended as follows

#  Define optimizer ,loss_function, The accuracy of calculation during training 
model.compile(
    optimizer=sgd,
    loss="categorical_crossentropy",
    metrics=['accuracy']
)

Then run the whole code ：
Insert picture description here
contrast 3.MNIST Data set classification Results of operation , It can be found that the classification model using cross entropy as the loss function can make the model converge faster , The effect is better. .

Complete code

The code running platform is jupyter-notebook, Code blocks in the article , According to jupyter-notebook Written in the order of division in , Run article code , Glue directly into jupyter-notebook that will do .
1. Import third-party library

import numpy as np
from keras.datasets import mnist
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.optimizers import SGD

2. Loading data and data preprocessing

#  Load data 
(x_train,y_train),(x_test,y_test) = mnist.load_data()
# (60000, 28, 28)
print("x_shape:\n",x_train.shape)
# (60000,)  Not yet one-hot code   You need to operate by yourself later 
print("y_shape:\n",y_train.shape)
# (60000, 28, 28) -> (60000,784) reshape() Middle parameter filling -1 Parameter results can be calculated automatically   Divide 255.0 To normalize 
x_train = x_train.reshape(x_train.shape[0],-1)/255.0
x_test = x_test.reshape(x_test.shape[0],-1)/255.0
#  in one hot Format 
y_train = np_utils.to_categorical(y_train,num_classes=10)
y_test = np_utils.to_categorical(y_test,num_classes=10)

3. Training models

#  Creating models   Input 784 Neurons , Output 10 Neurons 
model = Sequential([
        #  Define output yes 10  Input is 784, Set offset to 1, add to softmax Activation function 
        Dense(units=10,input_dim=784,bias_initializer='one',activation="softmax"),
])
#  Define optimizer 
sgd = SGD(lr=0.2)

#  Define optimizer ,loss_function, The accuracy of calculation during training 
model.compile(
    optimizer=sgd,
    loss="categorical_crossentropy",
    metrics=['accuracy']
)
#  Training models 
model.fit(x_train,y_train,batch_size=32,epochs=10)

#  Evaluation model 
loss,accuracy = model.evaluate(x_test,y_test)

print("\ntest loss",loss)
print("accuracy:",accuracy)