ๅฝๅไฝ็ฝฎ๏ผ็ฝ็ซ้ฆ้กต>Deep learning -- recurrent neural network
Deep learning -- recurrent neural network
2022-06-30 07:44:00 ใHair will grow again without itใ
Cyclic neural network
Why to introduce cyclic neural network
How to build a model , Build a neural network to learn ๐ To ๐ Mapping ๏ผ One of the ways to try is Using standard neural networks , In our previous example , We have 9 Input words . Imagine , Put this 9 Input words , May be 9 individual one-hot vector , Then input them into a standard neural network , Through some hidden layers , It will eventually output 9 The values are 0 or 1 The item , It indicates whether each input word is part of a person's name .
But the results show that this method is not good , There are two main problems ๏ผ
- Input and output data can have different lengths in different examples , Not all examples have the same input length ๐๐ฅ Or the same output length ๐๐ฆ. Even if every sentence has a maximum length , Maybe you can fill in ๏ผpad๏ผ Or zero fill ๏ผzero pad๏ผ Maximize the length of each input statement , But it still doesn't seem like a good way to express .
- It doesn't share features learned from different places in the text , say concretely , If the neural network has learned to be in place 1 The emergence of Harry It could be part of a person's name , So if Harry In other places , such as ๐ฅ<๐ก> when , It can also automatically identify a part of its name , That's great .
We mentioned these before ๏ผ Figure number 1 Shown ๐ฅ<1>โฆโฆ๐ฅ<๐ก>โฆโฆ๐ฅ<๐๐ฅ>๏ผ All are 10,000 Dimensional one-hot vector , So it's going to be a huge input layer . If the total input size is the maximum number of words times 10,000, Then the weight matrix of the first layer will have huge parameters .
What is a cyclic neural network ๏ผ
If you read this sentence from left to right , The first word is , If it is
๐ฅ<1>
, So here's what we're going to do Input the first word into a neural network layer , I'm going to draw like this , The first hidden layer of neural network , We can have neural networks try to predict the output , Determine if this is part of a person's name . What cyclic neural networks do is , When it reads the second word in the sentence , hypothesis๐ฅ<2>
, It doesn't just use ๐ฅ<2> Just predict ๐ฆ^<2>, He will also type in some from the time step 1 Information about . To be specific , Time step 1 The activation value of will be passed to the time step 2. then , At the next time step , The recurrent neural network inputs the words ๐ฅ<3>, Then it tries to predict the output of the prediction ๐ฆ<3>, wait , One ** Until the last time step , Input ๐ฅ<๐๐ฅ>**, And then output ๐ฆ<๐๐ฆ>. At least in this case ๐๐ฅ = ๐๐ฆ, At the same time, if ๐๐ฅ and ๐๐ฆ inequality , This structure will need to change . thereforeIn every time step , The cyclic neural network transmits an activation value to the next time step for calculation
.
You need to construct an activation value at zero time ๐<0>, This is usually a zero vector . Some researchers will randomly use other methods to initialize ๐<0>, However, using zero vector as pseudo activation value at zero time is the most common choice , So we put it in the neural network .
In every time step , You type ๐ฅ<๐ก> Then the output ๐ฆ<๐ก>. And then in order to represent circular connections sometimes people draw a circle like this , It means to input back to the network layer , Sometimes they draw a black square , To indicate that a time step will be delayed at this black square ( Like the recurrent neural network on the far right of the above figure ๏ผ
Cyclic neural networks scan data from left to right , At the same time, the parameters of each time step are also shared , We use it
๐ax
Come on Means managing from ๐ฅ<1> A series of parameters for the connection to the hidden layer , Each time step uses the same parameters ๐ax. and The activation value is the horizontal connection By parameters๐๐๐
Decisive , At the same time, each time step uses the same parameters ๐๐๐, alike Output results from๐ya
decision . In this recurrent neural network , It means ** In the forecast ๐ฆ<3> when , Not only to use ๐ฅ<3> Information about , Also use the ๐ฅ<1> and ๐ฅ<2> Information about **, Because from ๐ฅ<1> Information can help predict through such a green path ๐ฆ<3>.
The disadvantage of this network is ๏ผ It uses only the previous information in the sequence to make predictions , Especially when forecasting ๐ฆ^<3> when , It doesn't use ๐ฅ<4>,๐ฅ<5>,๐ฅ<6> And so on . So there's a problem , Because if given this sentence ,โTeddy Roosevelt was a great President.โ, To judge Teddy Is it part of a person's name , It's not enough just to know the first two words in the sentence , You also need to know the information in the last part of the sentence , It's also very useful , Because sentences can be like this ,โTeddy bears are on sale!โ. So if only the first three words are given , It's impossible to know exactly Teddy Is it part of a person's name , The first example is the name of a person , The second example is not , So you can't tell the difference just by looking at the first three words . The solution is Two way recurrent neural network .
Forward calculation of recurrent neural network
Here is a schematic diagram of neural network after cleaning , As I mentioned before , Usually start with ๐<0>, It's a zero vector . Then there is Forward propagation process , First calculate the activation value ๐<1>, And then calculate ๐ฆ<1>.๐<1> = ๐1(๐๐๐๐<0> + ๐๐๐ฅ๐ฅ<1> + ๐๐)
,๐ฆ^<1> = ๐2(๐๐ฆ๐๐<1> + ๐๐ฆ)
I'm going to use this notation convention to represent these matrix subscripts , for instance ๐ax, The second subscript means ๐ax To multiply by a certain ๐ฅ Type the amount of , Then the first subscript ๐ It means that it is used to calculate a ๐ Variable of type . alike , You can see here ๐ya By a certain ๐ Type the amount of , Used to calculate a ๐ฆ^ Type the amount of .
The activation function used in recurrent neural networks is often tanh, But sometimes I use ReLU, however tanh It's a more common choice , If it's a dichotomy , So I guess you'll use sigmoid Function as activation function , If it is ๐ Category classification problem , Then you can choose softmax As an activation function . But the type of activation function here depends on what type of output you have ๐ฆ, For Named Entity Recognition ๐ฆ Can only be 0 perhaps 1, So I guess the second activation function here ๐ It can be sigmoid Activation function .
More generally , stay ๐ก๐ moment ,<๐ก> = ๐1(๐๐๐๐<๐กโ1> + ๐๐๐ฅ๐ฅ<๐ก> + ๐๐)
๐ฆ^<๐ก> = ๐2(๐๐ฆ๐๐<๐ก> + ๐๐ฆ)
So these equations define the forward propagation of the neural network , You can start from the zero vector ๐<0> Start , And then use ๐<0> and ๐ฅ<1> To figure out ๐<1> and ๐ฆ<1>, And then use ๐ฅ<2> and ๐<1> Together ๐<2> and ๐ฆ<2> wait , Like in the picture , Complete forward propagation from left to right .
Next, in order to simplify these symbols , I'm going to put this part๏ผ๐aa๐<๐กโ1> + ๐ax๐ฅ<๐ก>๏ผ
In a simpler form , I write it as๐<๐ก> = ๐(๐๐[๐<๐กโ1>, ๐ฅ<๐ก>] + ๐๐)
, So the left and right sides should be equal . So we define๐๐
The way is The matrix ๐๐๐ And matrices ๐๐๐ฅ Place horizontally side by side ,[๐๐๐ โฎ ๐๐๐ฅ] = ๐๐
. for instance , If ๐ yes 100 Dimensional , Then continue with the previous example ,๐ฅ yes 10,000 Dimensional , that ๐๐๐ That's it. ๏ผ100,100๏ผ A matrix of dimensions ,๐๐๐ฅ That's it. ๏ผ100,10,000๏ผ A matrix of dimensions , So if you stack these two matrices ,๐๐ It will be a ๏ผ100,10,100๏ผ A matrix of dimensions .
Symbol ๏ผ[๐<๐กโ1>, ๐ฅ<๐ก>]๏ผ It means to pile these two vectors together .
You can check it yourself , Multiply this matrix by this vector , Just enough to get the original amount , Because at this time , matrix [๐๐๐ โฎ ๐๐๐ฅ] multiply [๐<๐กโ1> ๐ฅ<๐ก> ], Exactly equal to ๐๐๐๐<๐กโ1> + ๐๐๐ฅ๐ฅ<๐ก>, Just equal to the previous conclusion .
Again for this example ๏ผ๐ฆ^<๐ก> = ๐(๐๐ฆ๐๐<๐ก> + ๐๐ฆ)
๏ผ, I'll rewrite it in a simpler way ,๐ฆ^<๐ก> = ๐(๐๐ฆ๐<๐ก> + ๐๐ฆ). Now? ๐๐ฆ and ๐๐ฆ The symbol has only one subscript , It indicates what type of quantity will be output during calculation , therefore ๐๐ฆ It means that it is a calculation y Weight matrix of type quantity , While the above ๐๐ and ๐๐ These parameters are used to calculate ๐ Type or activation value .
Reverse calculation of recurrent neural network
Let's analyze Forward propagation calculation , Now you have an input sequence ,๐ฅ<1>,๐ฅ<2>,๐ฅ<3> Until ๐ฅ<๐๐ฅ>, And then use ๐ฅ<1> also ๐<0> Calculate the time step 1 The activation of , Reuse ๐ฅ<2> and ๐<1> To calculate the ๐<2>, And then calculate ๐<3> wait , Until ๐<๐๐ฅ>.
In order to really figure out ๐<1>, You also need some parameters ,๐๐
and๐๐
, Use them to work out๐<1>
. These parameters will be used at every time step after , So we continue to use these parameters to calculate ๐<2>,๐<3> wait , All of these activations depend on the parameters ๐๐ and ๐๐. With ๐<1>, The neural network can calculate the first prediction ๐ฆ<1>, Then to the next time step , Continue to work out ๐ฆ<2>,๐ฆ^<3>, wait , Until ๐ฆ^<๐๐ฆ>. To work out ๐ฆ^, Need a parameter๐๐ฆ
and๐๐ฆ
, They will be used for all these points .
And then for Computing back propagation , You need one more Loss function . Let's define an element loss function๐ฟ<๐ก>(๐ฆ^<๐ก> , ๐ฆ<๐ก>) = โ๐ฆ<๐ก>log ๐ฆ^<๐ก> โ (1 โ ๐ฆ^<๐ก>)๐๐๐(1 โ ๐ฆ^<๐ก>)
It corresponds to A specific word in a sequence , If it's someone's name , that ๐ฆ<๐ก> The value is 1, Then the neural network will output the probability that the word is a name , such as 0.1. I define it as a standard logistic regression loss function , Also called Cross entropy loss function ๏ผCross Entropy Loss๏ผ. This is about a single position or a time step ๐ก The loss function of the predicted value of a word .
Now let's Define the loss function for the entire sequence , take ๐ฟ Defined as๐ฟ(๐ฆ^ , ๐ฆ) = โ๐ฟ<๐ก>(๐ฆ^<๐ก> , ๐ฆ<๐ก>) ๐๐ฅ ๐ก=1
In this diagram , adopt ๐ฆ^<1> You can calculate the corresponding loss function , So we calculate the loss function of the first time step , Then calculate the loss function of the second time step , And then there was The third time step , Until the last time step , Finally, in order to calculate the total loss function ,, We're going to add them all up , Work out the final L.
Back propagation algorithm needs to calculate and transfer information in the opposite direction , What you end up doing is Turn the arrows that travel forward , After that, you can calculate all the appropriate quantities , And then you can use the derivative related parameters , Use gradient descent method to update parameters .
็ๆๅฃฐๆ
ๆฌๆไธบ[Hair will grow again without it]ๆๅ๏ผ่ฝฌ่ฝฝ่ฏทๅธฆไธๅๆ้พๆฅ๏ผๆ่ฐข
https://yzsam.com/2022/181/202206300722174353.html
่พนๆ ๆจ่
- 24C02
- ๆทฑๅบฆๅญฆไน โโ็ฝ็ปไธญ็็ฝ็ปไปฅๅ1x1ๅท็งฏ
- November 19, 2021 [reading notes] a summary of common problems of sneakemake (Part 2)
- MCU essay
- DS1302 digital tube clock
- Spring Festival inventory of Internet giants in 2022
- Final review -php learning notes 8-mysql database
- Xiashuo think tank: 42 reports on planet update today (including 23 planning cases)
- Deep learning - residual networks resnets
- Desk lamp control panel - brightness adjustment timer
็ไฝ ๅๆฌข
Deep learning -- feature point detection and target detection
Examen final - notes d'apprentissage PHP 6 - traitement des chaรฎnes
HelloWorld
right four steps of SEIF SLAM
ๆทฑๅบฆๅญฆไน โโ็ฝ็ปไธญ็็ฝ็ปไปฅๅ1x1ๅท็งฏ
November 21, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)
ๆทฑๅบฆๅญฆไน โโBounding Box้ขๆต
Periodic planning work
ๆๆซๅคไน -PHPๅญฆไน ็ฌ่ฎฐ1
Deep learning - goal orientation
้ๆบๆจ่
STM32 register
Shell command, how much do you know?
Final review -php learning notes 9-php session control
6ๆๅบไบ๏ผๅฏไปฅๅผๅงๅๅๅคไบ๏ผไธ็ถ่ฟไน่ต้ฑ็่กไธๅฐฑๆฒกไฝ ็ไปฝไบ
Deloitte: investment management industry outlook in 2022
The counting tool of combinatorial mathematics -- generating function
Lodash filter collection using array of values
November 9, 2020 [wgs/gwas] - whole genome analysis (association analysis) process (Part 2)
STM32 infrared communication 2
Cadence physical library lef file syntax learning [continuous update]
C. Fishingprince Plays With Array
November 21, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)
Pre ++ and post ++ overloads
Lexicographic order -- full arrangement in bell sound
ๆๆซๅคไน -PHPๅญฆไน ็ฌ่ฎฐ8-mysqlๆฐๆฎๅบ
DS1302 digital tube clock
Three software installation methods
ๆทฑๅบฆๅญฆไน โโไฝฟ็จ่ฏๅตๅ ฅand่ฏๅตๅ ฅ็นๅพ
Firewall firewalld
Recurrence relation (difference equation) -- Hanoi problem