当前位置：网站首页>Cyclic neural network

Cyclic neural network

2022-07-28 06:14:00 【Jiyu Wangchuan】

Cyclic neural network

One 、 Cyclic neural network

Cyclic neural network （recurrent neural network,RNN） It is a kind of neural network used to process sequence data .

1.1 CNN And RNN Simple comparison

CNN: With the help of convolution kernel （kernel） After feature extraction , Into the following network ( Such as fully connected network Dense)
To classify 、 Target detection and other operations .CNN With the help of convolution kernel from Spatial dimension Extract information , Convolution kernel parameter space sharing .
RNN: With the help of circulating nuclear （cell） After feature extraction , Into the following network ( Such as fully connected network Dense)
Make prediction and other operations .RNN With the help of cyclic nuclei from Time dimension Extract information , Cyclic kernel parameter time sharing .

1.2 Circulating nucleus

Circulating nucleus Have memory , Share through parameters at different times , Realize the information extraction of time series
take . Each cyclic core has multiple memories , Corresponding to several small cylinders in the figure below .

The memory body stores the state information of each moment $ℎ_𝑡$ ,
$𝑥_𝑡𝑤_{𝑥ℎ}+ℎ_{𝑡−1}w_{ℎℎ} + 𝑏ℎ)$ among , $𝑤_{𝑥ℎ}、w_{ℎℎ}$ For the weight matrix , $bh$ For bias , $𝑥_𝑡$ Input characteristics for the current time , $ℎ_{𝑡−1}$ Status information stored for the last time in memory , $t anh$ Is the activation function

The output characteristics of the cyclic kernel at the current time $𝑦_𝑡=softmax(ℎ_𝑡𝑤_{ℎ𝑦}+𝑏y )$ among $w_{wy}$ For the weight matrix 、 $b y$ For bias 、 $so f t ma x$ Is the activation function , In fact, it is equivalent to a full connection layer . We can set the number of memory to change the memory capacity , When the number of memory is specified 、 Input $𝑥_𝑡$ Output $y_t$ Dimensions are specified , The dimensions of these parameters to be trained are also limited .
In forward propagation , Memorize the state information stored in the body h𝑡𝑡 Refreshed at every moment , And the three parameter matrix $w_{xh}$ 、 $w_{hh}$ 、 $w_{hy}$ And two offset terms $bh$ 、 $b y$ It is fixed from beginning to end . In back propagation , Three parameter matrices and two bias terms are updated by gradient descent method .

1.3 The cycle core expands in time steps

Expand the cycle core in time steps , Is to expand the cycle core in the direction of the time axis , You can get the form shown in the figure below . Memory status information at each time $h_t$ Has been refreshed , The parameter matrix and two bias terms around the memory are fixed , What we train and optimize is these parameter matrices . After training , Use the best parameter matrix to perform forward propagation , Then output the prediction results .

In fact, this is consistent with our human prediction ： The memory in our brain is updated every moment according to the current input ; The current predictive reasoning is based on our previous knowledge accumulation “ Parameter matrix ” Reasoning and judgment .
It can be seen that , Cyclic neural network is to extract the time feature with the help of cyclic kernel
Information is sent to the fully connected network , So as to realize the prediction of continuous data .

1.4 Cyclic computing layer ： Grow in the direction of output

stay RNN in , Each cycle core forms a cycle computing layer , The number of layers of cyclic computing layer increases in the direction of output . As shown in the figure below , The network on the left has a cyclic core , It forms a layer of cyclic calculation ; The network in the figure has two cyclic cores , It forms two layers of cyclic calculation ; The network on the right has three cyclic cores , It forms a three-layer cyclic calculation layer . among , The number of memory in each cyclic core in the three networks can be arbitrarily specified according to our needs .

1.5 RNN Training

obtain RNN After the forward propagation of the results , Similar to other neural networks , We will define the loss function , The back propagation gradient descent algorithm is used to train the model .
RNN The only difference is ： Because of it, the node at each time may have an output , therefore RNN The total loss is all times （ Or part of the time ） Losses and .

Two 、Tensorflow Describe the cyclic computing layer

tf.keras.layers.SimpleRNN( Number of neurons ,activation=‘ Activation function ’,return_sequences= Whether to output every time ℎ𝑡 To the next floor )

(1) Number of neurons ： That is, the number of memory in the cyclic core
(2) return_sequences： In the output sequence , Returns the output value of the last time step ℎ𝑡 Or return the output of all time steps .False Return to the last moment ,True Return to all times . The next level is still RNN layer , Usually it is True, On the contrary, if it is followed by Dense layer , Usually it is Fasle.
(3) Input dimensions ： Three dimensional tensor ( Enter the number of samples , Cycle kernel time expansion steps , Enter the number of features in each time step ).

As shown in the figure above , The picture on the left shows a total of RNN Two sets of data in layer , After a time step, each group of data will get the output result , Enter three values in each time step , Then the data dimension of the input loop layer is chart 1.2.6 RNN Layer input dimension [2, 1, 3]
There is only one set of data input in the figure on the right , It is sent to the circulation layer in four time steps , Enter two values per time step , Then the data dimension of the input loop layer is [1,4,2].
(4) Output dimension ：

return_sequenc=True, Three dimensional tensor ( Enter the number of samples , Cycle kernel time expansion steps , Number of neurons in this layer )
return_sequenc=False, Two dimensional tensor ( Enter the number of samples , Number of neurons in this layer )

(5) activation：‘ Activation function ’( Do not write default use tanh）

3、 ... and 、 Example of cyclic calculation process

RNN The most typical application is to use historical data to predict what will happen at the next moment , That is, make predictions according to the historical laws seen before . Take a simple example of letter prediction to experience the calculation process of circular network ： Enter one letter to predict the next letter —— Input a Predict b、 Input b Predict c、 Input c Predict d、 Input d Predict e、 Input e Predict a. Computers don't recognize letters , Only numbers can be processed . So we need to code the letters . It is assumed here that the single heat coding is used （ In practice, other coding methods can be used ）, The coding results are shown in the figure below .

Hot coding alone	Letter
10000	a
01000	b
00100	c
00010	d
00001	e

Suppose you use a layer RNN The Internet , Select the number of memory 3, The network of letter prediction is shown in the figure below .

Suppose you enter a letter b, The input $𝑥_𝑡$ by $[0, 1, 0, 0, 0]$ , At this time, the memory status information of the last time $ℎ_𝑡$ by 0. From the above theoretical knowledge, it is not difficult to get ： $h_𝑡=tanh(𝑥_𝑡𝑤_{𝑥ℎ}+ℎ_{t-1}w_{ℎℎ}+𝑏)\\=tanh([−2.3,0.8,1.1 ]+ 0 + [ 0.5,0.3,−0.2])\\= tanh[−1.8,1.1,0.9 ] = [−0.9 ,0.8,0.7]$ This process can be understood as that the memory in the brain is updated due to the current input .
Output $y_𝑡$ It is the process of identifying and predicting the extracted time information through full connection , It is the output layer of the whole network .
$y_𝑡=softmax(ℎ_𝑡𝑤_{ℎ𝑦}+𝑏y )= softmax([−0.7,−0.6,2.9,0.7,−0.8] + [ 0.0,0.1,0.4,−0.7,0.1])\\= softmax([−0.7 −0.5 3.3 0.0 −0.7])= [0.02,0.02,0.91, 0.03,0.02 ]$ It can be seen that the model thinks there is 91% The possibility of outputting letters c , So the circular network outputs the prediction results c.

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, SimpleRNN
import matplotlib.pyplot as plt
import os

input_word = "abcde"
w_to_id = {
    'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}  #  Words are mapped to numbers id The dictionary of 
id_to_onehot = {
    0: [1., 0., 0., 0., 0.], 1: [0., 1., 0., 0., 0.], 2: [0., 0., 1., 0., 0.], 3: [0., 0., 0., 1., 0.],
                4: [0., 0., 0., 0., 1.]}  # id Encoded as one-hot

x_train = [id_to_onehot[w_to_id['a']], id_to_onehot[w_to_id['b']], id_to_onehot[w_to_id['c']],
           id_to_onehot[w_to_id['d']], id_to_onehot[w_to_id['e']]]
y_train = [w_to_id['b'], w_to_id['c'], w_to_id['d'], w_to_id['e'], w_to_id['a']]

np.random.seed(7)
np.random.shuffle(x_train)
np.random.seed(7)
np.random.shuffle(y_train)
tf.random.set_seed(7)

#  send x_train accord with SimpleRNN Input requirements ：[ Number of samples sent ,  Cycle kernel time expansion steps ,  Enter the number of features in each time step ].
#  Here, the whole data set is sent to , The number of samples sent is len(x_train); Input 1 A letter gives the result , The number of expansion steps of cycle kernel time is 1;  Expressed as a single hot code with 5 Input features , The number of input features in each time step is 5
x_train = np.reshape(x_train, (len(x_train), 1, 5))
y_train = np.array(y_train)

model = tf.keras.Sequential([
    SimpleRNN(3),
    Dense(5, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(0.01),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['sparse_categorical_accuracy'])

checkpoint_save_path = "./checkpoint/rnn_onehot_1pre1.ckpt"

if os.path.exists(checkpoint_save_path + '.index'):
    print('-------------load the model-----------------')
    model.load_weights(checkpoint_save_path)

cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_save_path,
                                                 save_weights_only=True,
                                                 save_best_only=True,
                                                 monitor='loss')  #  because fit No test set is given , Test set accuracy is not calculated , according to loss, Save the best model 

history = model.fit(x_train, y_train, batch_size=32, epochs=100, callbacks=[cp_callback])

model.summary()

# print(model.trainable_variables)
file = open('./weights.txt', 'w')  #  Parameter extraction 
for v in model.trainable_variables:
    file.write(str(v.name) + '\n')
    file.write(str(v.shape) + '\n')
    file.write(str(v.numpy()) + '\n')
file.close()

############################################### show ###############################################

#  Displays the of the training set and the validation set acc and loss curve 
acc = history.history['sparse_categorical_accuracy']
loss = history.history['loss']

plt.subplot(1, 2, 1)
plt.plot(acc, label='Training Accuracy')
plt.title('Training Accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(loss, label='Training Loss')
plt.title('Training Loss')
plt.legend()
plt.show()

############### predict #############

preNum = int(input("input the number of test alphabet:"))
for i in range(preNum):
    alphabet1 = input("input test alphabet:")
    alphabet = [id_to_onehot[w_to_id[alphabet1]]]
    #  send alphabet accord with SimpleRNN Input requirements ：[ Number of samples sent ,  Cycle kernel time expansion steps ,  Enter the number of features in each time step ]. Here the verification effect is sent to 1 Samples , The number of samples sent is 1; Input 1 A letter gives the result , So the number of cyclic kernel time expansion steps is 1;  Expressed as a single hot code with 5 Input features , The number of input features in each time step is 5
    alphabet = np.reshape(alphabet, (1, 1, 5))
    result = model.predict([alphabet])
    pred = tf.argmax(result, axis=1)
    pred = int(pred)
    tf.print(alphabet1 + '->' + input_word[pred])

The operation results are as follows ：

Epoch 100/100

5/5 [==============================] - 0s 34ms/sample - loss: 0.0400 - sparse_categorical_accuracy: 1.0000
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param # 
=================================================================
simple_rnn (SimpleRNN)       multiple                  27        
_________________________________________________________________
dense (Dense)                multiple                  20        
=================================================================
Total params: 47
Trainable params: 47
Non-trainable params: 0
_________________________________________________________________
input the number of test alphabet:>? 5
input test alphabet:>? a
a->b