ๅฝ“ๅ‰ไฝ็ฝฎ๏ผš็ฝ‘็ซ™้ฆ–้กต>Deep learning -- recurrent neural network

Deep learning -- recurrent neural network

2022-06-30 07:44:00 ใ€Hair will grow again without itใ€‘

Cyclic neural network

Why to introduce cyclic neural network

How to build a model , Build a neural network to learn ๐‘‹ To ๐‘Œ Mapping ๏ผŸ One of the ways to try is Using standard neural networks , In our previous example , We have 9 Input words . Imagine , Put this 9 Input words , May be 9 individual one-hot vector , Then input them into a standard neural network , Through some hidden layers , It will eventually output 9 The values are 0 or 1 The item , It indicates whether each input word is part of a person's name .
 Insert picture description here
But the results show that this method is not good , There are two main problems ๏ผš

  1. Input and output data can have different lengths in different examples , Not all examples have the same input length ๐‘‡๐‘ฅ Or the same output length ๐‘‡๐‘ฆ. Even if every sentence has a maximum length , Maybe you can fill in ๏ผˆpad๏ผ‰ Or zero fill ๏ผˆzero pad๏ผ‰ Maximize the length of each input statement , But it still doesn't seem like a good way to express .
  2. It doesn't share features learned from different places in the text , say concretely , If the neural network has learned to be in place 1 The emergence of Harry It could be part of a person's name , So if Harry In other places , such as ๐‘ฅ<๐‘ก> when , It can also automatically identify a part of its name , That's great .
    We mentioned these before ๏ผˆ Figure number 1 Shown ๐‘ฅ<1>โ€ฆโ€ฆ๐‘ฅ<๐‘ก>โ€ฆโ€ฆ๐‘ฅ<๐‘‡๐‘ฅ>๏ผ‰ All are 10,000 Dimensional one-hot vector , So it's going to be a huge input layer . If the total input size is the maximum number of words times 10,000, Then the weight matrix of the first layer will have huge parameters .

What is a cyclic neural network ๏ผŸ

If you read this sentence from left to right , The first word is , If it is ๐‘ฅ<1>, So here's what we're going to do Input the first word into a neural network layer , I'm going to draw like this , The first hidden layer of neural network , We can have neural networks try to predict the output , Determine if this is part of a person's name . What cyclic neural networks do is , When it reads the second word in the sentence , hypothesis ๐‘ฅ<2>, It doesn't just use ๐‘ฅ<2> Just predict ๐‘ฆ^<2>, He will also type in some from the time step 1 Information about . To be specific , Time step 1 The activation value of will be passed to the time step 2. then , At the next time step , The recurrent neural network inputs the words ๐‘ฅ<3>, Then it tries to predict the output of the prediction ๐‘ฆ<3>, wait , One ** Until the last time step , Input ๐‘ฅ<๐‘‡๐‘ฅ>**, And then output ๐‘ฆ<๐‘‡๐‘ฆ>. At least in this case ๐‘‡๐‘ฅ = ๐‘‡๐‘ฆ, At the same time, if ๐‘‡๐‘ฅ and ๐‘‡๐‘ฆ inequality , This structure will need to change . therefore In every time step , The cyclic neural network transmits an activation value to the next time step for calculation .
 Insert picture description here
You need to construct an activation value at zero time ๐‘Ž<0>, This is usually a zero vector . Some researchers will randomly use other methods to initialize ๐‘Ž<0>, However, using zero vector as pseudo activation value at zero time is the most common choice , So we put it in the neural network .
In every time step , You type ๐‘ฅ<๐‘ก> Then the output ๐‘ฆ<๐‘ก>. And then in order to represent circular connections sometimes people draw a circle like this , It means to input back to the network layer , Sometimes they draw a black square , To indicate that a time step will be delayed at this black square ( Like the recurrent neural network on the far right of the above figure ๏ผ‰

Cyclic neural networks scan data from left to right , At the same time, the parameters of each time step are also shared , We use it ๐‘Šax Come on Means managing from ๐‘ฅ<1> A series of parameters for the connection to the hidden layer , Each time step uses the same parameters ๐‘Šax. and The activation value is the horizontal connection By parameters ๐‘Š๐‘Ž๐‘Ž Decisive , At the same time, each time step uses the same parameters ๐‘Š๐‘Ž๐‘Ž, alike Output results from ๐‘Šya decision . In this recurrent neural network , It means ** In the forecast ๐‘ฆ<3> when , Not only to use ๐‘ฅ<3> Information about , Also use the ๐‘ฅ<1> and ๐‘ฅ<2> Information about **, Because from ๐‘ฅ<1> Information can help predict through such a green path ๐‘ฆ<3>.
 Insert picture description here

The disadvantage of this network is ๏ผš It uses only the previous information in the sequence to make predictions , Especially when forecasting ๐‘ฆ^<3> when , It doesn't use ๐‘ฅ<4>,๐‘ฅ<5>,๐‘ฅ<6> And so on . So there's a problem , Because if given this sentence ,โ€œTeddy Roosevelt was a great President.โ€, To judge Teddy Is it part of a person's name , It's not enough just to know the first two words in the sentence , You also need to know the information in the last part of the sentence , It's also very useful , Because sentences can be like this ,โ€œTeddy bears are on sale!โ€. So if only the first three words are given , It's impossible to know exactly Teddy Is it part of a person's name , The first example is the name of a person , The second example is not , So you can't tell the difference just by looking at the first three words . The solution is Two way recurrent neural network .

Forward calculation of recurrent neural network

 Insert picture description here
Here is a schematic diagram of neural network after cleaning , As I mentioned before , Usually start with ๐‘Ž<0>, It's a zero vector . Then there is Forward propagation process , First calculate the activation value ๐‘Ž<1>, And then calculate ๐‘ฆ<1>. ๐‘Ž<1> = ๐‘”1(๐‘Š๐‘Ž๐‘Ž๐‘Ž<0> + ๐‘Š๐‘Ž๐‘ฅ๐‘ฅ<1> + ๐‘๐‘Ž) ,๐‘ฆ^<1> = ๐‘”2(๐‘Š๐‘ฆ๐‘Ž๐‘Ž<1> + ๐‘๐‘ฆ) I'm going to use this notation convention to represent these matrix subscripts , for instance ๐‘Šax, The second subscript means ๐‘Šax To multiply by a certain ๐‘ฅ Type the amount of , Then the first subscript ๐‘Ž It means that it is used to calculate a ๐‘Ž Variable of type . alike , You can see here ๐‘Šya By a certain ๐‘Ž Type the amount of , Used to calculate a ๐‘ฆ^ Type the amount of .
The activation function used in recurrent neural networks is often tanh, But sometimes I use ReLU, however tanh It's a more common choice , If it's a dichotomy , So I guess you'll use sigmoid Function as activation function , If it is ๐‘˜ Category classification problem , Then you can choose softmax As an activation function . But the type of activation function here depends on what type of output you have ๐‘ฆ, For Named Entity Recognition ๐‘ฆ Can only be 0 perhaps 1, So I guess the second activation function here ๐‘” It can be sigmoid Activation function .
More generally , stay ๐‘ก๐‘Ž moment ,
<๐‘ก> = ๐‘”1(๐‘Š๐‘Ž๐‘Ž๐‘Ž<๐‘กโˆ’1> + ๐‘Š๐‘Ž๐‘ฅ๐‘ฅ<๐‘ก> + ๐‘๐‘Ž)
๐‘ฆ^<๐‘ก> = ๐‘”2(๐‘Š๐‘ฆ๐‘Ž๐‘Ž<๐‘ก> + ๐‘๐‘ฆ)
So these equations define the forward propagation of the neural network , You can start from the zero vector ๐‘Ž<0> Start , And then use ๐‘Ž<0> and ๐‘ฅ<1> To figure out ๐‘Ž<1> and ๐‘ฆ<1>, And then use ๐‘ฅ<2> and ๐‘Ž<1> Together ๐‘Ž<2> and ๐‘ฆ<2> wait , Like in the picture , Complete forward propagation from left to right .
 Insert picture description here
Next, in order to simplify these symbols , I'm going to put this part ๏ผˆ๐‘Šaa๐‘Ž<๐‘กโˆ’1> + ๐‘Šax๐‘ฅ<๐‘ก>๏ผ‰ In a simpler form , I write it as ๐‘Ž<๐‘ก> = ๐‘”(๐‘Š๐‘Ž[๐‘Ž<๐‘กโˆ’1>, ๐‘ฅ<๐‘ก>] + ๐‘๐‘Ž), So the left and right sides should be equal . So we define ๐‘Š๐‘Ž The way is The matrix ๐‘Š๐‘Ž๐‘Ž And matrices ๐‘Š๐‘Ž๐‘ฅ Place horizontally side by side ,[๐‘Š๐‘Ž๐‘Ž โ‹ฎ ๐‘Š๐‘Ž๐‘ฅ] = ๐‘Š๐‘Ž. for instance , If ๐‘Ž yes 100 Dimensional , Then continue with the previous example ,๐‘ฅ yes 10,000 Dimensional , that ๐‘Š๐‘Ž๐‘Ž That's it. ๏ผˆ100,100๏ผ‰ A matrix of dimensions ,๐‘Š๐‘Ž๐‘ฅ That's it. ๏ผˆ100,10,000๏ผ‰ A matrix of dimensions , So if you stack these two matrices ,๐‘Š๐‘Ž It will be a ๏ผˆ100,10,100๏ผ‰ A matrix of dimensions .
Symbol ๏ผˆ[๐‘Ž<๐‘กโˆ’1>, ๐‘ฅ<๐‘ก>]๏ผ‰ It means to pile these two vectors together .
You can check it yourself , Multiply this matrix by this vector , Just enough to get the original amount , Because at this time , matrix [๐‘Š๐‘Ž๐‘Ž โ‹ฎ ๐‘Š๐‘Ž๐‘ฅ] multiply [๐‘Ž<๐‘กโˆ’1> ๐‘ฅ<๐‘ก> ], Exactly equal to ๐‘Š๐‘Ž๐‘Ž๐‘Ž<๐‘กโˆ’1> + ๐‘Š๐‘Ž๐‘ฅ๐‘ฅ<๐‘ก>, Just equal to the previous conclusion .
Again for this example ๏ผˆ๐‘ฆ^<๐‘ก> = ๐‘”(๐‘Š๐‘ฆ๐‘Ž๐‘Ž<๐‘ก> + ๐‘๐‘ฆ)๏ผ‰, I'll rewrite it in a simpler way ,๐‘ฆ^<๐‘ก> = ๐‘”(๐‘Š๐‘ฆ๐‘Ž<๐‘ก> + ๐‘๐‘ฆ). Now? ๐‘Š๐‘ฆ and ๐‘๐‘ฆ The symbol has only one subscript , It indicates what type of quantity will be output during calculation , therefore ๐‘Š๐‘ฆ It means that it is a calculation y Weight matrix of type quantity , While the above ๐‘Š๐‘Ž and ๐‘๐‘Ž These parameters are used to calculate ๐‘Ž Type or activation value .
 Insert picture description here

Reverse calculation of recurrent neural network

Let's analyze Forward propagation calculation , Now you have an input sequence ,๐‘ฅ<1>,๐‘ฅ<2>,๐‘ฅ<3> Until ๐‘ฅ<๐‘‡๐‘ฅ>, And then use ๐‘ฅ<1> also ๐‘Ž<0> Calculate the time step 1 The activation of , Reuse ๐‘ฅ<2> and ๐‘Ž<1> To calculate the ๐‘Ž<2>, And then calculate ๐‘Ž<3> wait , Until ๐‘Ž<๐‘‡๐‘ฅ>.
In order to really figure out ๐‘Ž<1>, You also need some parameters ,๐‘Š๐‘Ž and ๐‘๐‘Ž, Use them to work out ๐‘Ž<1>. These parameters will be used at every time step after , So we continue to use these parameters to calculate ๐‘Ž<2>,๐‘Ž<3> wait , All of these activations depend on the parameters ๐‘Š๐‘Ž and ๐‘๐‘Ž. With ๐‘Ž<1>, The neural network can calculate the first prediction ๐‘ฆ<1>, Then to the next time step , Continue to work out ๐‘ฆ<2>,๐‘ฆ^<3>, wait , Until ๐‘ฆ^<๐‘‡๐‘ฆ>. To work out ๐‘ฆ^, Need a parameter ๐‘Š๐‘ฆ and ๐‘๐‘ฆ, They will be used for all these points .
And then for Computing back propagation , You need one more Loss function . Let's define an element loss function
๐ฟ<๐‘ก>(๐‘ฆ^<๐‘ก> , ๐‘ฆ<๐‘ก>) = โˆ’๐‘ฆ<๐‘ก>log ๐‘ฆ^<๐‘ก> โˆ’ (1 โˆ’ ๐‘ฆ^<๐‘ก>)๐‘™๐‘œ๐‘”(1 โˆ’ ๐‘ฆ^<๐‘ก>)
It corresponds to A specific word in a sequence , If it's someone's name , that ๐‘ฆ<๐‘ก> The value is 1, Then the neural network will output the probability that the word is a name , such as 0.1. I define it as a standard logistic regression loss function , Also called Cross entropy loss function ๏ผˆCross Entropy Loss๏ผ‰. This is about a single position or a time step ๐‘ก The loss function of the predicted value of a word .
Now let's Define the loss function for the entire sequence , take ๐ฟ Defined as
๐ฟ(๐‘ฆ^ , ๐‘ฆ) = โˆ‘๐ฟ<๐‘ก>(๐‘ฆ^<๐‘ก> , ๐‘ฆ<๐‘ก>) ๐‘‡๐‘ฅ ๐‘ก=1
In this diagram , adopt ๐‘ฆ^<1> You can calculate the corresponding loss function , So we calculate the loss function of the first time step , Then calculate the loss function of the second time step , And then there was The third time step , Until the last time step , Finally, in order to calculate the total loss function ,, We're going to add them all up , Work out the final L.
Back propagation algorithm needs to calculate and transfer information in the opposite direction , What you end up doing is Turn the arrows that travel forward , After that, you can calculate all the appropriate quantities , And then you can use the derivative related parameters , Use gradient descent method to update parameters .
 Insert picture description here
 Insert picture description here

ๅŽŸ็ฝ‘็ซ™

็‰ˆๆƒๅฃฐๆ˜Ž
ๆœฌๆ–‡ไธบ[Hair will grow again without it]ๆ‰€ๅˆ›๏ผŒ่ฝฌ่ฝฝ่ฏทๅธฆไธŠๅŽŸๆ–‡้“พๆŽฅ๏ผŒๆ„Ÿ่ฐข
https://yzsam.com/2022/181/202206300722174353.html