当前位置:网站首页>Machine learning notes - circular neural network memo list

Machine learning notes - circular neural network memo list

2022-06-12 08:59:00 Sit and watch the clouds rise

One 、RNN summary

1、 Tradition RNN The architecture of

         Tradition RNN Architecture of recurrent neural network , Also known as RNN, It is a kind of neural network that allows the previous output to be used as input and has a hidden state . They usually look like this :

         For each time step t, Activate a^{< t>}  And the output y^{< t>} Shown by the following :

        \boxed{a^{< t>}=g_1(W_{aa}a^{< t-1>}+W_{ax}x^{< t>}+b_a)}\quad\textrm{and}\quad\boxed{y^{< t>}=g_2(W_{ya}a^{< t>}+b_y)}

         among W_{ax}, W_{aa}, W_{ya}, b_a, b_y Is the coefficient shared in time and g_1, g_2 Activation function .

         The following table summarizes the typical RNN Advantages and disadvantages of Architecture :

         advantage : Can handle any length of input 、 Model size does not increase with input size 、 The calculation takes into account historical information 、 Weights are shared across time

        shortcoming : The calculation is slow 、 It's hard to get information from a long time ago 、 Any future input of the current state cannot be considered

2、RNN Application

        RNN The model is mainly used in naturallanguageprocessing and speech recognition . The following table summarizes the different applications :

Type of RNNIllustrationExample
One-to-one
T_x=T_y=1
Traditional neural networks

One-to-many

T_x =1,T_y>1

Music generation

Many-to-one

T_x>1, T_y=1

Emotion classification

Many-to-many

T_x =T_y

Name entity recognition

Many-to-many

T_x \neq T_y

Machine translation

 3、 Loss function

          In the case of recurrent neural networks , The loss function of all time steps \mathcal{L} The loss based on each time step is defined as follows :

         \boxed{\mathcal{L}(\widehat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}(\widehat{y}^{< t>},y^{< t>})}

 4、 Time back propagation

          Back propagation at every point in time . In time step T, Loss \mathcal{L} About the weight matrix W The derivative of is expressed as follows :

        \boxed{\frac{\partial \mathcal{L}^{(T)}}{\partial W}=\sum_{t=1}^T\left.\frac{\partial\mathcal{L}^{(T)}}{\partial W}\right|_{(t)}}

Two 、 Dealing with long-term dependencies

1、 Common activation functions

        RNN The most commonly used activation functions in the module are as follows :

2、 disappear / Explosion gradient

         stay RNN The phenomena of gradient disappearance and explosion are often encountered in the context of . They occur because it is difficult to capture long-term dependencies , Because the multiplicative gradient can decrease exponentially with respect to the number of layers / increase .

3、Gradient clipping

         It is a technique used to deal with the gradient explosion problem sometimes encountered when performing back propagation . By limiting the maximum value of the gradient , This phenomenon is controlled in practice .

4、Types of gates

         In order to solve the problem of gradient vanishing , In some types of RNN Specific doors are used in , And usually has a definite purpose . They are usually recorded as \Gamma And equal to :

        \boxed{\Gamma=\sigma(Wx^{< t>}+Ua^{< t-1>}+b)}

         among W, U, b It's a gate specific factor ,\sigma yes sigmoid function . The main conclusions are as follows :

Type of gateRoleUsed in
Update gate \Gamma_uΓu​How much past should matter now?GRU, LSTM
Relevance gate \Gamma_rΓr​Drop previous information?GRU, LSTM
Forget gate \Gamma_fΓf​Erase a cell or not?LSTM
Output gate \Gamma_oΓo​How much to reveal of a cell?LSTM

5、GRU/LSTM

         Door control cycle unit (GRU) And long-term and short-term memory units (LSTM) Dealing with tradition RNN The gradient vanishing problem encountered ,LSTM yes GRU Generalization . The following table summarizes the characteristic equations for each architecture :

         remarks : Symbol \star​​​​ Represents the element multiplication between two vectors .

 6、RNN A variation of the

         The following table summarizes other commonly used RNN framework :

3、 ... and 、 Learn words to express

1、 Motivation and symbols

         Represent Technology : The following table summarizes the two main word expressions :

 2、 Embedded matrix

         For a given word w, Embedded matrix E Is to 1-hot Express o_w Map to its embedded e_w Matrix , As shown below :

        \boxed{e_w=Eo_w}

         remarks : Learning to embed a matrix can use goals / Context likelihood model .

Four 、Word embeddings

1、Word2vec

        Word2vec Is a framework designed to learn word embedding by estimating the probability that a given word is surrounded by other words . Popular models include skip-gram、 Negative sampling and CBOW.

 2、Skip-gram

        skip-gram word2vec A model is a supervised learning task , It does this by evaluating any given target word t And context words c  The possibility of occurrence to learn word embedding . By paying attention to \theta_t And t Related parameters , probability P(t|c) Given by the following formula :

        \boxed{P(t|c)=\frac{\exp(\theta_t^Te_c)}{\displaystyle\sum_{j=1}^{|V|}\exp(\theta_j^Te_c)}}

         remarks : stay softmax Sum the whole vocabulary in the denominator of the part , This makes the calculation cost of this model very high . CBOW Is another word2vec Model , It uses surrounding words to predict a given word .

3、Negative sampling

         It is a set of binary classifiers using logistic regression , It aims to evaluate how a given context and a given target word may appear at the same time , Model in k Negative samples and 1 Training on a positive sample set . Given a context word c And a target word t, Forecast expressed as :

        \boxed{P(y=1|c,t)=\sigma(\theta_t^Te_c)}

         remarks : The calculation cost of this method is lower than skip-gram Model

4、GloVe

        GloVe A model is an abbreviation for a global vector of words , Is a word embedding technology , It uses a co-occurrence matrix X, Each of them X_{i,j} It means a goal i In context j Is the number of times . Its cost function J as follows :

        \boxed{J(\theta)=\frac{1}{2}\sum_{i,j=1}^{|V|}f(X_{ij})(\theta_i^Te_j+b_i+b_j'-\log(X_{ij}))^2}

         among f Is a weighting function , bring X_{i,j}=0\Longrightarrow f(X_{i,j})=0.

         Whereas e and \theta  Symmetry played in this model , The final word is embedded e_w^{(\textrm{final})} Given by the following formula :

        \boxed{e_w^{(\textrm{final})}=\frac{e_w+\theta_w}{2}}

5、Comparing words

(1)Cosine similarity

         word w_1 and w_2 The cosine similarity of is expressed as follows :

        \boxed{\textrm{similarity}=\frac{w_1\cdot w_2}{||w_1||\textrm{ }||w_2||}=\cos(\theta)}

         remarks :\theta Is the word w_1 and w_2 The angle between

        

 (2)t-SNE

        t-SNE(t-distributed Stochastic Neighbor Embedding) It is a technology that aims to reduce high-dimensional embedding to low-dimensional space . In practice , It's usually used in 2D Visual word vector in space .

  5、 ... and 、 Language model

         Language models are designed to estimate sentences P(y) Probability .

1、n-gram model

         This model is a simple way , This paper aims to quantify the probability of an expression appearing in the corpus by calculating the number of times an expression appears in the training data .

2、Perplexity

         Language models often use a measure of confusion ( Also known as PP) To assess the , It can be interpreted as the number of words T Inverse probability of normalized data set . The lower the confusion, the better , The definition is as follows :

        \boxed{\textrm{PP}=\prod_{t=1}^T\left(\frac{1}{\sum_{j=1}^{|V|}y_j^{(t)}\cdot \widehat{y}_j^{(t)}}\right)^{\frac{1}{T}}}

         remarks :PP Commonly used in t-SNE.

6、 ... and 、 Machine translation

         The machinetranslation model is similar to the language model , Just before it is placed an encoder network . For this reason , It is sometimes called the conditional language model .

         The goal is to find a sentence y bring :

        \boxed{y=\underset{y^{< 1>}, ..., y^{< T_y>}}{\textrm{arg max}}P(y^{< 1>},...,y^{< T_y>}|x)}

1、Beam search

         It is a heuristic search algorithm , For machinetranslation and speech recognition , In the given input x Find the most likely sentence y.

         The first 1 Step : Find the most likely B  word y^{< 1>}

         The first 2 Step : Calculate the conditional probability  y^{< k>}|x,y^{< 1>},...,y^{< k-1>}

         The first 3 Step : Leave the B Combine x,y^{< 1>},...,y^{< k>}

          remarks : If beam width Set to 1, So this is equivalent to a naive greedy search.

2、Beam width

         Beam width B Is the parameter of beam search . The larger B Value will produce better results , But it will reduce performance and increase memory . smaller B Values can lead to worse results , But the amount of calculation is small . B The standard value of is about 10.

3、Length normalization

         To improve numerical stability , Beam search is usually applied to the following normalized targets , It is usually called normalized log likelihood target , Defined as :

                \boxed{\textrm{Objective } = \frac{1}{T_y^\alpha}\sum_{t=1}^{T_y}\log\Big[p(y^{< t>}|x,y^{< 1>}, ..., y^{< t-1>})\Big]}

         notes : Parameters α Can be seen as a softener , Its value is usually in 0.5 To 1 Between .

4、Error analysis

         When getting a bad forecast translation \widehat{y} when , One may wonder why we did not get a good translation by performing the following error analysis y^*

5、Bleu score

         Bilingual evaluation substitute (bleu) The score is calculated based on n-gram The similarity score of accuracy is used to quantify the quality of machinetranslation . The definition is as follows :

                \boxed{\textrm{bleu score}=\exp\left(\frac{1}{n}\sum_{k=1}^np_k\right)}

         among p_n yes n-gram Upper bleu fraction , Only the following are defined :

                p_n=\frac{\displaystyle\sum_{\textrm{n-gram}\in\widehat{y}}\textrm{count}_{\textrm{clip}}(\textrm{n-gram})}{\displaystyle\sum_{\textrm{n-gram}\in\widehat{y}}\textrm{count}(\textrm{n-gram})}

         remarks : Brevity penalties can be applied to short predictive translations , To prevent artificial exaggeration bleu fraction .

7、 ... and 、 attention

1、 Attention model

         The model allows RNN Focus on specific parts of the input that are considered important , Thus, the performance of the final model in practice is improved . By paying attention to \alpha^{< t, t'>} Output y^{< t>} Should be activated for a^{< t'> } and c^{< t>} In time t The context of , We have :

        \boxed{c^{< t>}=\sum_{t'}\alpha^{< t, t'>}a^{< t'>}}\quad\textrm{with}\quad\sum_{t'}\alpha^{< t,t'>}=1

         remarks : Attention score is often used in image captioning and machinetranslation .

2、 Attention weight

         Output y^{< t>} Should be activated for a^{< t'>} The amount of attention given is determined by \alpha^{< t,t'>} give The calculation is as follows :

        \boxed{\alpha^{< t,t'>}=\frac{\exp(e^{< t,t'>})}{\displaystyle\sum_{t''=1}^{T_x}\exp(e^{< t,t''>})}}

         remarks : The computational complexity is T_x The second power of

原网站

版权声明
本文为[Sit and watch the clouds rise]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/163/202206120853411535.html