当前位置：网站首页>2022 Tsinghua summer school notes L2_ 1 basic composition of neural network

2022 Tsinghua summer school notes L2_ 1 basic composition of neural network

2022-07-24 21:33:00 【The duck neck is gone】

2022 Tsinghua University large model cross Seminar

L2 Neural Network basics

1 The basic composition of neural network

1.1 Neuron

A single neuron ：

Put the weight vector （ matrix ） Multiply with the input vector point , Get a scalar value , Plus offset b（ Scalar ） Send in the nonlinear activation function f, Get the output .

1.2 neural network

Multiple neurons form a single-layer neural network ：

When there are multiple neurons , The weight changes from vector to matrix (3*3), bias b From scalar to vector (b1,b2,b3).
Stack single-layer neural networks to get multi-layer neural networks ：

We can calculate the result of each layer from the input , The result of each layer is the result of the previous layer through linear change and activation function .

1.3 Activation function

why use f？ Why use nonlinear function to activate ？
- Pictured , Suppose there is only linear transformation in our network , After two layers of network , We found that h2 It is completely possible to use the initial input data after only one change .
- therefore , The expression ability of single layer is consistent with that of multiple layers , In order to prevent the collapse of the network , Increase the expression ability of the network , To fit more complex functions , We introduce nonlinear network structure .
Common nonlinear activation functions

1.4 Output layer

Determine the output layer according to different output forms ：
- linear output
  - Add a linear layer after the hidden layer to output directly .
  - For regression problems .
- Sigmoid
  - First, use the ordinary linear layer to get a value , Then use sigmoid Activation function , Press the output to 0-1 In this range .
  - It is applicable to binary classification problems .
- Softmax
  - First calculate a linear layer with the last hidden layer , Get an output z, Then substitute into the function $y_{i}=\operatorname{softmax}(z)_{i}=\frac{\exp \left(z_{i}\right)}{\sum_{j} \exp \left(z_{j}\right)}$ .
  - Purpose ： Eliminated z When it is negative ; Make all output class values sum to 1, The probability distributions of different classes are obtained .
  - Often solve multi classification problems .

2 Training

2.1 Training objectives ：

Forecast target ： Reduce mean square error （ The return question ）
The goal of classification ： Minimize cross entropy

If the correct answer is the first category , We can calculate the cross entropy as 0.74; If the correct answer is the second category , It can be calculated that the cross entropy is 1.74; If the correct answer is the third category ……

2.2 How to update

Concept of gradient descent ：
- We reduce the loss function a little at a time
- Each time, calculate the gradient of the loss function with respect to the parameter , That is, we get the place where the loss function changes fastest for parameters . Because we want to take the minimum , So we choose the direction with the largest absolute value in the negative direction .
Gradient descent :
- For a single input （ It can be regarded as a one-dimensional parameter ）, Finding partial derivatives
- about n One input time , See the picture below , The resulting gradient matrix can be obtained .
The trick of gradient descent ：
- Continuous derivation
- Back propagation algorithm
  - Forward propagation refers to the order in which the edges point , Among them, the function of the edge is to transfer values .
  - In order to find the gradient of the final output to an input value , We use the opposite direction of calculation .
  - Take one paragraph as an example , Introduce the calculation method of single node ：
    - Multiply the upstream gradient by the local gradient , The gradient downstream can be calculated , By analogy, we can continue to find the gradient downstream .

3 Lexical representation Word2Vec

3.1 Sliding Box： A fixed size sliding window

When the window moves to one end of the sentence , Only target

3.2 CBOW： according to context, forecast target

take never and late use one-hot Vector representation , Average these two vectors , Then turn the word vector into the size of the vocabulary , Finally through softmax Get the probability distribution .

3.3 skip-gram： according to target, forecast context

Because it is too difficult for the model to predict multiple results , So we decompose the task , one by one .

3.4 improvement ：

3.4.1 disadvantages ： full softmax when , If you encounter a large vocabulary , Goodbye back propagation and gradient descent , The speed will be slow .

3.4.2 Two ways to improve computational efficiency ：

Negative sampling
Understand reference
- Only a small part is sampled , Sample according to the frequency of words .
  $P\left(w_{i}\right)=\frac{f\left(w_{i}\right)^{3 / 4}}{\sum_{j=1}^{\mathbb{V}} f\left(w_{j}\right)^{3 / 4}}$
- 3/4 Empirical value , In order to slightly improve the sampling frequency of low-frequency words .
layered softmax

3.4.3 Other tips：

sub-sampling： Balance common words with rare words
- Common words appear frequently , The semantics covered are not very rich , Rare words are the opposite .
  $1-\sqrt{t / f(w)}$
- If a word appears more frequently , The more likely he is to be removed
soft sliding window
- Words that are farther away should be considered less