当前位置:网站首页>2022 Tsinghua summer school notes L2_ 1 basic composition of neural network
2022 Tsinghua summer school notes L2_ 1 basic composition of neural network
2022-07-24 21:33:00 【The duck neck is gone】
2022 Tsinghua University large model cross Seminar
L2 Neural Network basics
1 The basic composition of neural network
1.1 Neuron
- A single neuron :

Put the weight vector ( matrix ) Multiply with the input vector point , Get a scalar value , Plus offset b( Scalar ) Send in the nonlinear activation function f, Get the output .
1.2 neural network
- Multiple neurons form a single-layer neural network :

When there are multiple neurons , The weight changes from vector to matrix (3*3), bias b From scalar to vector (b1,b2,b3). - Stack single-layer neural networks to get multi-layer neural networks :

We can calculate the result of each layer from the input , The result of each layer is the result of the previous layer through linear change and activation function .
1.3 Activation function
- why use f? Why use nonlinear function to activate ?

- Pictured , Suppose there is only linear transformation in our network , After two layers of network , We found that h2 It is completely possible to use the initial input data after only one change .
- therefore , The expression ability of single layer is consistent with that of multiple layers , In order to prevent the collapse of the network , Increase the expression ability of the network , To fit more complex functions , We introduce nonlinear network structure .
- Common nonlinear activation functions

1.4 Output layer
- Determine the output layer according to different output forms :
- linear output
- Add a linear layer after the hidden layer to output directly .
- For regression problems .
- Sigmoid
- First, use the ordinary linear layer to get a value , Then use sigmoid Activation function , Press the output to 0-1 In this range .
- It is applicable to binary classification problems .
- Softmax
- First calculate a linear layer with the last hidden layer , Get an output z, Then substitute into the function y i = softmax ( z ) i = exp ( z i ) ∑ j exp ( z j ) y_{i}=\operatorname{softmax}(z)_{i}=\frac{\exp \left(z_{i}\right)}{\sum_{j} \exp \left(z_{j}\right)} yi=softmax(z)i=∑jexp(zj)exp(zi).
- Purpose : Eliminated z When it is negative ; Make all output class values sum to 1, The probability distributions of different classes are obtained .
- Often solve multi classification problems .
- linear output
2 Training
2.1 Training objectives :
- Forecast target : Reduce mean square error ( The return question )
- The goal of classification : Minimize cross entropy

If the correct answer is the first category , We can calculate the cross entropy as 0.74; If the correct answer is the second category , It can be calculated that the cross entropy is 1.74; If the correct answer is the third category ……
2.2 How to update
- Concept of gradient descent :

- We reduce the loss function a little at a time
- Each time, calculate the gradient of the loss function with respect to the parameter , That is, we get the place where the loss function changes fastest for parameters . Because we want to take the minimum , So we choose the direction with the largest absolute value in the negative direction .
- Gradient descent :
- For a single input ( It can be regarded as a one-dimensional parameter ), Finding partial derivatives
- about n One input time , See the picture below , The resulting gradient matrix can be obtained .

- The trick of gradient descent :
- Continuous derivation
- Back propagation algorithm
- Forward propagation refers to the order in which the edges point , Among them, the function of the edge is to transfer values .

- In order to find the gradient of the final output to an input value , We use the opposite direction of calculation .
- Take one paragraph as an example , Introduce the calculation method of single node :

- Multiply the upstream gradient by the local gradient , The gradient downstream can be calculated , By analogy, we can continue to find the gradient downstream .

- Multiply the upstream gradient by the local gradient , The gradient downstream can be calculated , By analogy, we can continue to find the gradient downstream .
- Forward propagation refers to the order in which the edges point , Among them, the function of the edge is to transfer values .
3 Lexical representation Word2Vec
3.1 Sliding Box: A fixed size sliding window
When the window moves to one end of the sentence , Only target
3.2 CBOW: according to context, forecast target
take never and late use one-hot Vector representation , Average these two vectors , Then turn the word vector into the size of the vocabulary , Finally through softmax Get the probability distribution .
3.3 skip-gram: according to target, forecast context
- Because it is too difficult for the model to predict multiple results , So we decompose the task , one by one .
3.4 improvement :
3.4.1 disadvantages : full softmax when , If you encounter a large vocabulary , Goodbye back propagation and gradient descent , The speed will be slow .
3.4.2 Two ways to improve computational efficiency :
- Negative sampling
Understand reference- Only a small part is sampled , Sample according to the frequency of words .
P ( w i ) = f ( w i ) 3 / 4 ∑ j = 1 V f ( w j ) 3 / 4 P\left(w_{i}\right)=\frac{f\left(w_{i}\right)^{3 / 4}}{\sum_{j=1}^{\mathbb{V}} f\left(w_{j}\right)^{3 / 4}} P(wi)=∑j=1Vf(wj)3/4f(wi)3/4 - 3/4 Empirical value , In order to slightly improve the sampling frequency of low-frequency words .
- Only a small part is sampled , Sample according to the frequency of words .
- layered softmax
3.4.3 Other tips:
- sub-sampling: Balance common words with rare words
- Common words appear frequently , The semantics covered are not very rich , Rare words are the opposite .
1 − t / f ( w ) 1-\sqrt{t / f(w)} 1−t/f(w) - If a word appears more frequently , The more likely he is to be removed
- Common words appear frequently , The semantics covered are not very rich , Rare words are the opposite .
- soft sliding window
- Words that are farther away should be considered less
边栏推荐
- C WinForm actual operation XML code, including the demonstration of creating, saving, querying and deleting forms
- Quick sort
- How to gracefully realize regular backup of MySQL database (glory Collection Edition)
- [jzof] 04 search in two-dimensional array
- APR learning failure problem location and troubleshooting
- Defects of matrix initialization
- How to prevent weight under Gao Bingfa?
- Drive subsystem development
- [verification of ID number]
- Atcoder beginer contest 260 a~f problem solution
猜你喜欢

Little Red Book Keyword Search commodity list API interface (commodity detail page API interface)

CAD sets hyperlinks to entities (WEB version)

Opencv learning Day2

Leetcode skimming -- bit by bit record 018
![[SOC] the first project of SOC Hello World](/img/ae/326312cb3b5a372c7b8b048936a9f2.png)
[SOC] the first project of SOC Hello World

Rce (no echo)

Volcano engine releases cloud growth solutions for six industries

Summary of yarn capacity scheduler

Ch single database data migration to read / write separation mode

How do test / development programmers survive the midlife crisis? You can see it at a glance
随机推荐
Baidu interview question - judge whether a positive integer is to the power of K of 2
Press Ctrl to pop up a dialog box. How to close this dialog box?
Intel internship mentor layout problem 1
None of the most complete MySQL commands in history is applicable to work and interview (supreme Collection Edition)
Alibaba cloud and parallel cloud launched the cloud XR platform to support the rapid landing of immersive experience applications
250 million, Banan District perception system data collection, background analysis, Xueliang engineering network and operation and maintenance service project: Chinatelecom won the bid
Lecun proposed that mask strategy can also be applied to twin networks based on vit for self supervised learning!
Drive subsystem development
The relationship between cloud computing and digital transformation has finally been clarified
Leetcode skimming -- bit by bit record 017
Five digital transformation strategies of B2B Enterprises
CAD disable a button on the toolbar (WEB version)
Intranet penetration learning (I) introduction to Intranet
About the acid of MySQL, there are thirty rounds of skirmishes with mvcc and interviewers
Smarter! Airiot accelerates the upgrading of energy conservation and emission reduction in the coal industry
Lenovo Filez helps Zhongshui North achieve safe and efficient file management
How to prevent weight under Gao Bingfa?
Is there any capital requirement for the online account opening of Ping An Securities? Is it safe
Big country "grain" policy | wheat expert Liu Luxiang: China's rations are absolutely safe, and the key to increasing grain potential lies in science and technology
[development tutorial 6] crazy shell arm function mobile phone - interruption experiment tutorial