当前位置：网站首页>Machine learning notes - circular neural network memo list

Machine learning notes - circular neural network memo list

2022-06-12 08:59:00 【Sit and watch the clouds rise】

One 、RNN summary

1、 Tradition RNN The architecture of

Tradition RNN Architecture of recurrent neural network , Also known as RNN, It is a kind of neural network that allows the previous output to be used as input and has a hidden state . They usually look like this ：

For each time step , Activate $a^{< t>}$ And the output $y^{< t>}$ Shown by the following ：

$\boxed{a^{< t>}=g_1(W_{aa}a^{< t-1>}+W_{ax}x^{< t>}+b_a)}\quad\textrm{and}\quad\boxed{y^{< t>}=g_2(W_{ya}a^{< t>}+b_y)}$

among $W_{ax}$ , $W_{aa}$ , $W_{ya}$ , b_a , b_y Is the coefficient shared in time and g_1 , g_2 Activation function .

The following table summarizes the typical RNN Advantages and disadvantages of Architecture ：

advantage ： Can handle any length of input 、 Model size does not increase with input size 、 The calculation takes into account historical information 、 Weights are shared across time

shortcoming ： The calculation is slow 、 It's hard to get information from a long time ago 、 Any future input of the current state cannot be considered

2、RNN Application

RNN The model is mainly used in naturallanguageprocessing and speech recognition . The following table summarizes the different applications ：

Type of RNN	Illustration	Example
One-to-one		Traditional neural networks
One-to-many		Music generation
Many-to-one		Emotion classification
Many-to-many		Name entity recognition
Many-to-many $T_x \neq T_y$		Machine translation

3、 Loss function

In the case of recurrent neural networks , The loss function of all time steps $\mathcal{L}$ The loss based on each time step is defined as follows ：

$\boxed{\mathcal{L}(\widehat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}(\widehat{y}^{< t>},y^{< t>})}$

4、 Time back propagation

Back propagation at every point in time . In time step , Loss $\mathcal{L}$ About the weight matrix The derivative of is expressed as follows ：

$\boxed{\frac{\partial \mathcal{L}^{(T)}}{\partial W}=\sum_{t=1}^T\left.\frac{\partial\mathcal{L}^{(T)}}{\partial W}\right|_{(t)}}$

Two 、 Dealing with long-term dependencies

1、 Common activation functions

RNN The most commonly used activation functions in the module are as follows ：

2、 disappear / Explosion gradient

stay RNN The phenomena of gradient disappearance and explosion are often encountered in the context of . They occur because it is difficult to capture long-term dependencies , Because the multiplicative gradient can decrease exponentially with respect to the number of layers / increase .

3、Gradient clipping

It is a technique used to deal with the gradient explosion problem sometimes encountered when performing back propagation . By limiting the maximum value of the gradient , This phenomenon is controlled in practice .

4、Types of gates

In order to solve the problem of gradient vanishing , In some types of RNN Specific doors are used in , And usually has a definite purpose . They are usually recorded as $\Gamma$ And equal to ：

$\boxed{\Gamma=\sigma(Wx^{< t>}+Ua^{< t-1>}+b)}$

among W, U, b It's a gate specific factor , $\sigma$ yes sigmoid function . The main conclusions are as follows ：

Type of gate	Role	Used in
Update gate \Gamma_uΓu	How much past should matter now?	GRU, LSTM
Relevance gate \Gamma_rΓr	Drop previous information?	GRU, LSTM
Forget gate \Gamma_fΓf	Erase a cell or not?	LSTM
Output gate \Gamma_oΓo	How much to reveal of a cell?	LSTM

5、GRU/LSTM

Door control cycle unit （GRU） And long-term and short-term memory units （LSTM） Dealing with tradition RNN The gradient vanishing problem encountered ,LSTM yes GRU Generalization . The following table summarizes the characteristic equations for each architecture ：

remarks ： Symbol $\star$ Represents the element multiplication between two vectors .

6、RNN A variation of the

The following table summarizes other commonly used RNN framework ：

3、 ... and 、 Learn words to express

1、 Motivation and symbols

Represent Technology ： The following table summarizes the two main word expressions ：

2、 Embedded matrix

For a given word , Embedded matrix Is to 1-hot Express o_w Map to its embedded e_w Matrix , As shown below ：

$\boxed{e_w=Eo_w}$

remarks ： Learning to embed a matrix can use goals / Context likelihood model .

Four 、Word embeddings

1、Word2vec

Word2vec Is a framework designed to learn word embedding by estimating the probability that a given word is surrounded by other words . Popular models include skip-gram、 Negative sampling and CBOW.

2、Skip-gram

skip-gram word2vec A model is a supervised learning task , It does this by evaluating any given target word And context words The possibility of occurrence to learn word embedding . By paying attention to $\theta_t$ And Related parameters , probability P(t|c) Given by the following formula ：

$\boxed{P(t|c)=\frac{\exp(\theta_t^Te_c)}{\displaystyle\sum_{j=1}^{|V|}\exp(\theta_j^Te_c)}}$

remarks ： stay softmax Sum the whole vocabulary in the denominator of the part , This makes the calculation cost of this model very high . CBOW Is another word2vec Model , It uses surrounding words to predict a given word .

3、Negative sampling

It is a set of binary classifiers using logistic regression , It aims to evaluate how a given context and a given target word may appear at the same time , Model in Negative samples and 1 Training on a positive sample set . Given a context word And a target word , Forecast expressed as ：

$\boxed{P(y=1|c,t)=\sigma(\theta_t^Te_c)}$

remarks ： The calculation cost of this method is lower than skip-gram Model

4、GloVe

GloVe A model is an abbreviation for a global vector of words , Is a word embedding technology , It uses a co-occurrence matrix , Each of them $X_{i,j}$ It means a goal In context Is the number of times . Its cost function as follows ：

$\boxed{J(\theta)=\frac{1}{2}\sum_{i,j=1}^{|V|}f(X_{ij})(\theta_i^Te_j+b_i+b_j'-\log(X_{ij}))^2}$

among Is a weighting function , bring $X_{i,j}=0\Longrightarrow f(X_{i,j})=0$ .

Whereas and $\theta$ Symmetry played in this model , The final word is embedded $e_w^{(\textrm{final})}$ Given by the following formula ：

$\boxed{e_w^{(\textrm{final})}=\frac{e_w+\theta_w}{2}}$

5、Comparing words

（1）Cosine similarity

word w_1 and w_2 The cosine similarity of is expressed as follows ：

$\boxed{\textrm{similarity}=\frac{w_1\cdot w_2}{||w_1||\textrm{ }||w_2||}=\cos(\theta)}$

remarks ： $\theta$ Is the word w_1 and w_2 The angle between

（2）t-SNE

t-SNE（t-distributed Stochastic Neighbor Embedding） It is a technology that aims to reduce high-dimensional embedding to low-dimensional space . In practice , It's usually used in 2D Visual word vector in space .

5、 ... and 、 Language model

Language models are designed to estimate sentences P(y) Probability .

1、n-gram model

This model is a simple way , This paper aims to quantify the probability of an expression appearing in the corpus by calculating the number of times an expression appears in the training data .

2、Perplexity

Language models often use a measure of confusion （ Also known as PP） To assess the , It can be interpreted as the number of words T Inverse probability of normalized data set . The lower the confusion, the better , The definition is as follows ：

$\boxed{\textrm{PP}=\prod_{t=1}^T\left(\frac{1}{\sum_{j=1}^{|V|}y_j^{(t)}\cdot \widehat{y}_j^{(t)}}\right)^{\frac{1}{T}}}$

remarks ：PP Commonly used in t-SNE.

6、 ... and 、 Machine translation

The machinetranslation model is similar to the language model , Just before it is placed an encoder network . For this reason , It is sometimes called the conditional language model .

The goal is to find a sentence y bring ：

$\boxed{y=\underset{y^{< 1>}, ..., y^{< T_y>}}{\textrm{arg max}}P(y^{< 1>},...,y^{< T_y>}|x)}$

1、Beam search

It is a heuristic search algorithm , For machinetranslation and speech recognition , In the given input Find the most likely sentence .

The first 1 Step ： Find the most likely word $y^{< 1>}$

The first 2 Step ： Calculate the conditional probability $y^{< k>}|x,y^{< 1>},...,y^{< k-1>}$

The first 3 Step ： Leave the B Combine $x,y^{< 1>},...,y^{< k>}$

remarks ： If beam width Set to 1, So this is equivalent to a naive greedy search.

2、Beam width

Beam width B Is the parameter of beam search . The larger B Value will produce better results , But it will reduce performance and increase memory . smaller B Values can lead to worse results , But the amount of calculation is small . B The standard value of is about 10.

3、Length normalization

To improve numerical stability , Beam search is usually applied to the following normalized targets , It is usually called normalized log likelihood target , Defined as ：

$\boxed{\textrm{Objective } = \frac{1}{T_y^\alpha}\sum_{t=1}^{T_y}\log\Big[p(y^{< t>}|x,y^{< 1>}, ..., y^{< t-1>})\Big]}$

notes ： Parameters α Can be seen as a softener , Its value is usually in 0.5 To 1 Between .

4、Error analysis

When getting a bad forecast translation $\widehat{y}$ when , One may wonder why we did not get a good translation by performing the following error analysis y^* ：

5、Bleu score

Bilingual evaluation substitute (bleu) The score is calculated based on n-gram The similarity score of accuracy is used to quantify the quality of machinetranslation . The definition is as follows ：

$\boxed{\textrm{bleu score}=\exp\left(\frac{1}{n}\sum_{k=1}^np_k\right)}$

among p_n yes n-gram Upper bleu fraction , Only the following are defined ：

$p_n=\frac{\displaystyle\sum_{\textrm{n-gram}\in\widehat{y}}\textrm{count}_{\textrm{clip}}(\textrm{n-gram})}{\displaystyle\sum_{\textrm{n-gram}\in\widehat{y}}\textrm{count}(\textrm{n-gram})}$

remarks ： Brevity penalties can be applied to short predictive translations , To prevent artificial exaggeration bleu fraction .

7、 ... and 、 attention

1、 Attention model

The model allows RNN Focus on specific parts of the input that are considered important , Thus, the performance of the final model in practice is improved . By paying attention to $\alpha^{< t, t'>}$ Output $y^{< t>}$ Should be activated for $a^{< t'> }$ and $c^{< t>}$ In time t The context of , We have ：

$\boxed{c^{< t>}=\sum_{t'}\alpha^{< t, t'>}a^{< t'>}}\quad\textrm{with}\quad\sum_{t'}\alpha^{< t,t'>}=1$

remarks ： Attention score is often used in image captioning and machinetranslation .

2、 Attention weight

Output $y^{< t>}$ Should be activated for $a^{< t'>}$ The amount of attention given is determined by $\alpha^{< t,t'>}$ give The calculation is as follows ：

$\boxed{\alpha^{< t,t'>}=\frac{\exp(e^{< t,t'>})}{\displaystyle\sum_{t''=1}^{T_x}\exp(e^{< t,t''>})}}$