当前位置：网站首页>Deep learning -- recurrent neural network

Deep learning -- recurrent neural network

2022-06-30 07:44:00 【Hair will grow again without it】

Cyclic neural network

Why to introduce cyclic neural network

How to build a model , Build a neural network to learn 𝑋 To 𝑌 Mapping ？ One of the ways to try is Using standard neural networks , In our previous example , We have 9 Input words . Imagine , Put this 9 Input words , May be 9 individual one-hot vector , Then input them into a standard neural network , Through some hidden layers , It will eventually output 9 The values are 0 or 1 The item , It indicates whether each input word is part of a person's name .

But the results show that this method is not good , There are two main problems ：
Input and output data can have different lengths in different examples , Not all examples have the same input length 𝑇𝑥 Or the same output length 𝑇𝑦. Even if every sentence has a maximum length , Maybe you can fill in （pad） Or zero fill （zero pad） Maximize the length of each input statement , But it still doesn't seem like a good way to express .
It doesn't share features learned from different places in the text , say concretely , If the neural network has learned to be in place 1 The emergence of Harry It could be part of a person's name , So if Harry In other places , such as 𝑥<𝑡> when , It can also automatically identify a part of its name , That's great .
We mentioned these before （ Figure number 1 Shown 𝑥<1>……𝑥<𝑡>……𝑥<𝑇𝑥>） All are 10,000 Dimensional one-hot vector , So it's going to be a huge input layer . If the total input size is the maximum number of words times 10,000, Then the weight matrix of the first layer will have huge parameters .

What is a cyclic neural network ？

If you read this sentence from left to right , The first word is , If it is 𝑥<1>, So here's what we're going to do Input the first word into a neural network layer , I'm going to draw like this , The first hidden layer of neural network , We can have neural networks try to predict the output , Determine if this is part of a person's name . What cyclic neural networks do is , When it reads the second word in the sentence , hypothesis 𝑥<2>, It doesn't just use 𝑥<2> Just predict 𝑦^<2>, He will also type in some from the time step 1 Information about . To be specific , Time step 1 The activation value of will be passed to the time step 2. then , At the next time step , The recurrent neural network inputs the words 𝑥<3>, Then it tries to predict the output of the prediction 𝑦^{<3>, wait , One ** Until the last time step , Input 𝑥<𝑇𝑥>**, And then output 𝑦}<𝑇𝑦>. At least in this case 𝑇𝑥 = 𝑇𝑦, At the same time, if 𝑇𝑥 and 𝑇𝑦 inequality , This structure will need to change . therefore In every time step , The cyclic neural network transmits an activation value to the next time step for calculation .

You need to construct an activation value at zero time 𝑎<0>, This is usually a zero vector . Some researchers will randomly use other methods to initialize 𝑎<0>, However, using zero vector as pseudo activation value at zero time is the most common choice , So we put it in the neural network .
In every time step , You type 𝑥<𝑡> Then the output 𝑦<𝑡>. And then in order to represent circular connections sometimes people draw a circle like this , It means to input back to the network layer , Sometimes they draw a black square , To indicate that a time step will be delayed at this black square ( Like the recurrent neural network on the far right of the above figure ）

Cyclic neural networks scan data from left to right , At the same time, the parameters of each time step are also shared , We use it 𝑊ax Come on Means managing from 𝑥<1> A series of parameters for the connection to the hidden layer , Each time step uses the same parameters 𝑊ax. and The activation value is the horizontal connection By parameters 𝑊𝑎𝑎 Decisive , At the same time, each time step uses the same parameters 𝑊𝑎𝑎, alike Output results from 𝑊ya decision . In this recurrent neural network , It means ** In the forecast 𝑦^{<3> when , Not only to use 𝑥<3> Information about , Also use the 𝑥<1> and 𝑥<2> Information about **, Because from 𝑥<1> Information can help predict through such a green path 𝑦}<3>.

The disadvantage of this network is ： It uses only the previous information in the sequence to make predictions , Especially when forecasting 𝑦^<3> when , It doesn't use 𝑥<4>,𝑥<5>,𝑥<6> And so on . So there's a problem , Because if given this sentence ,“Teddy Roosevelt was a great President.”, To judge Teddy Is it part of a person's name , It's not enough just to know the first two words in the sentence , You also need to know the information in the last part of the sentence , It's also very useful , Because sentences can be like this ,“Teddy bears are on sale!”. So if only the first three words are given , It's impossible to know exactly Teddy Is it part of a person's name , The first example is the name of a person , The second example is not , So you can't tell the difference just by looking at the first three words . The solution is Two way recurrent neural network .

Forward calculation of recurrent neural network

Here is a schematic diagram of neural network after cleaning , As I mentioned before , Usually start with 𝑎<0>, It's a zero vector . Then there is Forward propagation process , First calculate the activation value 𝑎<1>, And then calculate 𝑦<1>. 𝑎<1> = 𝑔1(𝑊𝑎𝑎𝑎<0> + 𝑊𝑎𝑥𝑥<1> + 𝑏𝑎) ,𝑦^<1> = 𝑔2(𝑊𝑦𝑎𝑎<1> + 𝑏𝑦) I'm going to use this notation convention to represent these matrix subscripts , for instance 𝑊ax, The second subscript means 𝑊ax To multiply by a certain 𝑥 Type the amount of , Then the first subscript 𝑎 It means that it is used to calculate a 𝑎 Variable of type . alike , You can see here 𝑊ya By a certain 𝑎 Type the amount of , Used to calculate a 𝑦^ Type the amount of .
The activation function used in recurrent neural networks is often tanh, But sometimes I use ReLU, however tanh It's a more common choice , If it's a dichotomy , So I guess you'll use sigmoid Function as activation function , If it is 𝑘 Category classification problem , Then you can choose softmax As an activation function . But the type of activation function here depends on what type of output you have 𝑦, For Named Entity Recognition 𝑦 Can only be 0 perhaps 1, So I guess the second activation function here 𝑔 It can be sigmoid Activation function .
More generally , stay 𝑡𝑎 moment ,
<𝑡> = 𝑔1(𝑊𝑎𝑎𝑎<𝑡−1> + 𝑊𝑎𝑥𝑥<𝑡> + 𝑏𝑎)
𝑦^<𝑡> = 𝑔2(𝑊𝑦𝑎𝑎<𝑡> + 𝑏𝑦)
So these equations define the forward propagation of the neural network , You can start from the zero vector 𝑎<0> Start , And then use 𝑎<0> and 𝑥<1> To figure out 𝑎<1> and 𝑦^{<1>, And then use 𝑥<2> and 𝑎<1> Together 𝑎<2> and 𝑦}<2> wait , Like in the picture , Complete forward propagation from left to right .

Next, in order to simplify these symbols , I'm going to put this part （𝑊aa𝑎<𝑡−1> + 𝑊ax𝑥<𝑡>） In a simpler form , I write it as 𝑎<𝑡> = 𝑔(𝑊𝑎[𝑎<𝑡−1>, 𝑥<𝑡>] + 𝑏𝑎), So the left and right sides should be equal . So we define 𝑊𝑎 The way is The matrix 𝑊𝑎𝑎 And matrices 𝑊𝑎𝑥 Place horizontally side by side ,[𝑊𝑎𝑎 ⋮ 𝑊𝑎𝑥] = 𝑊𝑎. for instance , If 𝑎 yes 100 Dimensional , Then continue with the previous example ,𝑥 yes 10,000 Dimensional , that 𝑊𝑎𝑎 That's it. （100,100） A matrix of dimensions ,𝑊𝑎𝑥 That's it. （100,10,000） A matrix of dimensions , So if you stack these two matrices ,𝑊𝑎 It will be a （100,10,100） A matrix of dimensions .
Symbol （[𝑎<𝑡−1>, 𝑥<𝑡>]） It means to pile these two vectors together .
You can check it yourself , Multiply this matrix by this vector , Just enough to get the original amount , Because at this time , matrix [𝑊𝑎𝑎 ⋮ 𝑊𝑎𝑥] multiply [𝑎<𝑡−1> 𝑥<𝑡> ], Exactly equal to 𝑊𝑎𝑎𝑎<𝑡−1> + 𝑊𝑎𝑥𝑥<𝑡>, Just equal to the previous conclusion .
Again for this example （𝑦^<𝑡> = 𝑔(𝑊𝑦𝑎𝑎<𝑡> + 𝑏𝑦)）, I'll rewrite it in a simpler way ,𝑦^<𝑡> = 𝑔(𝑊𝑦𝑎<𝑡> + 𝑏𝑦). Now? 𝑊𝑦 and 𝑏𝑦 The symbol has only one subscript , It indicates what type of quantity will be output during calculation , therefore 𝑊𝑦 It means that it is a calculation y Weight matrix of type quantity , While the above 𝑊𝑎 and 𝑏𝑎 These parameters are used to calculate 𝑎 Type or activation value .

Reverse calculation of recurrent neural network

Let's analyze Forward propagation calculation , Now you have an input sequence ,𝑥<1>,𝑥<2>,𝑥<3> Until 𝑥<𝑇𝑥>, And then use 𝑥<1> also 𝑎<0> Calculate the time step 1 The activation of , Reuse 𝑥<2> and 𝑎<1> To calculate the 𝑎<2>, And then calculate 𝑎<3> wait , Until 𝑎<𝑇𝑥>.
In order to really figure out 𝑎<1>, You also need some parameters ,𝑊𝑎 and 𝑏𝑎, Use them to work out 𝑎<1>. These parameters will be used at every time step after , So we continue to use these parameters to calculate 𝑎<2>,𝑎<3> wait , All of these activations depend on the parameters 𝑊𝑎 and 𝑏𝑎. With 𝑎<1>, The neural network can calculate the first prediction 𝑦^{<1>, Then to the next time step , Continue to work out 𝑦}<2>,𝑦^<3>, wait , Until 𝑦^<𝑇𝑦>. To work out 𝑦^, Need a parameter 𝑊𝑦 and 𝑏𝑦, They will be used for all these points .
And then for Computing back propagation , You need one more Loss function . Let's define an element loss function
𝐿<𝑡>(𝑦^<𝑡> , 𝑦<𝑡>) = −𝑦<𝑡>log 𝑦^<𝑡> − (1 − 𝑦^<𝑡>)𝑙𝑜𝑔(1 − 𝑦^<𝑡>)
It corresponds to A specific word in a sequence , If it's someone's name , that 𝑦<𝑡> The value is 1, Then the neural network will output the probability that the word is a name , such as 0.1. I define it as a standard logistic regression loss function , Also called Cross entropy loss function （Cross Entropy Loss）. This is about a single position or a time step 𝑡 The loss function of the predicted value of a word .
Now let's Define the loss function for the entire sequence , take 𝐿 Defined as
𝐿(𝑦^ , 𝑦) = ∑𝐿<𝑡>(𝑦^<𝑡> , 𝑦<𝑡>) 𝑇𝑥 𝑡=1
In this diagram , adopt 𝑦^<1> You can calculate the corresponding loss function , So we calculate the loss function of the first time step , Then calculate the loss function of the second time step , And then there was The third time step , Until the last time step , Finally, in order to calculate the total loss function ,, We're going to add them all up , Work out the final L.
Back propagation algorithm needs to calculate and transfer information in the opposite direction , What you end up doing is Turn the arrows that travel forward , After that, you can calculate all the appropriate quantities , And then you can use the derivative related parameters , Use gradient descent method to update parameters .